CN109214504B

CN109214504B - FPGA-based YOLO network forward reasoning accelerator design method

Info

Publication number: CN109214504B
Application number: CN201810970836.2A
Authority: CN
Inventors: 张轶凡; 陈昊; 应山川; 李玮
Original assignee: Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Current assignee: Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2020-09-04
Anticipated expiration: 2038-08-24
Also published as: CN109214504A

Abstract

The invention provides a design method of a YOLO network forward reasoning accelerator based on an FPGA, wherein the accelerator comprises an FPGA chip and a DRAM, a memory BRAM in the FPGA chip is used as a data buffer, and the DRAM is used as a main storage device; the accelerator design method comprises the following steps: (1) carrying out 8-bit fixed point number quantization on original network data to obtain a decimal point position with the minimum influence on detection precision, and forming a quantization scheme, wherein the quantization process is carried out layer by layer; (2) the FPGA chip performs parallel computation on nine layers of convolution networks of the YOLO; (3) and (6) mapping the position. The method solves the technical problems that in the prior art, the increase speed of storage resources on an FPGA chip is not as fast as the scale increase of a neural network, and a general target detection network is difficult to transplant to the FPGA chip according to the traditional design thought, and achieves the purpose of using fewer resources on a chip to achieve higher speed.

Description

FPGA-based YOLO network forward reasoning accelerator design method

Technical Field

The invention relates to the technical field of deep learning and hardware structure design, in particular to a design method for carrying out forward reasoning acceleration on an FPGA (field programmable gate array) by a target detection network.

Background

In recent years, machine learning algorithms based on Convolutional neural networks (Convolutional neural networks) have been widely applied to computer vision tasks. However, for a large-scale CNN network, the characteristics of intensive computation, intensive storage and great resource consumption bring great challenges to the tasks. In the face of such high computational pressure and large data throughput, the performance of the conventional general-purpose processor is difficult to meet practical requirements, so that a hardware accelerator based on a GPU, an FPGA or an ASIC is proposed and widely put into use.

FPGA (field Programmable Gate array) is a product of further development on the basis of Programmable devices such as PAL, GAL, EPLD and the like. The circuit is used as a semi-custom circuit in the field of ASIC, not only solves the defects of the custom circuit, but also overcomes the defect of limited gate circuit quantity of the original programmable device. The FPGA adopts a new concept of a Logic Cell Array (LCA), and the FPGA internally comprises a configurable logic module CLB, an input/output module IOB and an internal connecting line, and can support one PROM to program a plurality of FPGAs. Due to the flexible reconfigurable capability, and the excellent performance to power ratio, FPGAs are becoming an important deep learning processor today.

The mainstream target detection network suitable for hardware implementation at present is YOLO (you Only Look Once), the network has high speed and simple structure, the algorithm processes the object detection problem into a regression problem, the position and the class probability of a target frame can be directly predicted from an input image by using a convolution neural network structure, and the end-to-end object detection is realized, and the structure is more suitable for hardware implementation on an FPGA. The invention discloses a general fixed point number neural network convolution accelerator hardware structure based on FPGA in CN107392309A, which comprises: the system comprises a general AXI4 high-speed bus interface, a high-parallel convolution kernel and characteristic map data cache region, a segmented convolution result cache region, a convolution calculator, a cache region controller, a state controller and a direct access controller. The invention uses on-chip storage as buffer, and off-chip memory as main storage device of data, and uses a general processor outside the chip to manage memory so as to complete the calculation of the whole convolution network. The present invention CN107463990A provides an FPGA parallel acceleration method for a convolutional neural network, which includes the following steps: (1) establishing a CNN model; (2) configuring a hardware architecture; (3) a convolution operation unit is configured. The invention uses the temporary calculation result of loading the whole network by the on-chip storage, so the scale of the deployable network is limited.

The existing neural network accelerator based on FPGA usually stores all intermediate calculation results of a network layer into an on-chip static memory, and stores weights required by a network into an off-chip dynamic memory, so that the design can cause the space of the on-chip memory to limit the network scale capable of being accelerated. At present, as the requirements for task complexity and precision become higher, the convolutional neural network is larger in scale and the total quantity of parameters is also larger, but the process of the FPGA chip and the storage resources which can be accommodated on the chip are not increased so rapidly, and if the design method is still adopted, the FPGA can not accommodate the network with the scale.

If the on-chip static memory BRAM is used as a data buffer area, and the off-chip dynamic memory DRAM is used as main data storage of the network, the network with large parameter quantity can be accommodated due to the huge storage space of the dynamic memory, and the parallel calculation of each convolution module is realized by reasonably distributing the bandwidth of the memory. The performance of this design approach depends on the bandwidth of the memory, but boosting the bandwidth of the communication is easier to achieve than stacking the memory resources on chip. The network referred by the invention is a version of YOLO-tiny, the input size of the network is 416 x 3, the network has 9 layers of convolution, the final output is a candidate frame with information of category, position and confidence coefficient, and the calculation result is mapped to the original image through a region mapping (region operation) algorithm.

Disclosure of Invention

Aiming at solving the technical problems that the increase speed of storage resources on an FPGA chip is not as fast as the scale increase of a neural network in the prior art, and a general target detection network is difficult to be transplanted to the FPGA chip according to the traditional design idea, the invention provides a YOLO network forward reasoning accelerator based on the FPGA aiming at a development platform of a YOLO-tiny network and a KC705, and the specific technical scheme is as follows:

a design method of a FPGA-based YOLO network forward reasoning accelerator comprises an FPGA chip and a DRAM, wherein a memory BRAM in the FPGA chip is used as a data buffer, the DRAM is used as a main storage device, and a ping-pong structure is used in the DRAM; the design method of the accelerator is characterized by comprising the following steps of:

(1) carrying out 8-bit fixed point number quantization on original network data to obtain a decimal point position with the minimum influence on detection precision, and forming a quantization scheme, wherein the quantization process is carried out layer by layer;

(2) the FPGA chip performs parallel computation on nine layers of convolution networks of the YOLO;

(3) and (6) mapping the position.

Specifically, the quantization process of a certain layer in step (1) is as follows:

a) quantifying the weight data of the original network: establishing 256 decimal values corresponding to a certain decimal point position of the 8-bit fixed point number, wherein the values comprise positive zeros and negative zeros, quantizing the original data by using a nearby principle, expressing the quantized numerical value by adopting a 32-bit floating point type so as to be convenient for calculation, obtaining the detection precision of the quantization scheme, traversing the 8 decimal point positions to obtain a decimal point position with the minimum influence on the detection precision, and finally forming a weight quantization scheme of the layer;

b) normalizing the distribution of 0-1 to the input feature map, and then quantizing the input feature map of the layer by using the method in the step a);

c) taking the characteristic diagram quantized in the step b) as an input, performing forward propagation on all pictures only for the layer of convolution, loading parameters by using a quantized 32bit value, and taking an obtained output quantity as an input quantity of a next layer of network.

d) And quantizing the weight and the input characteristic diagram of each layer alternately according to the method of the steps a) to c), and finally obtaining the quantization schemes of all the layers.

Specifically, the calculation process of each layer of convolutional network in the step (2) is as follows:

a) reading weight data required by the calculation of the current round from the DRAM, and placing the weight data into the BRAM;

b) reading feature map data (FM) to be convolved in the layer to complete all input data preparation;

c) and performing convolution calculation, uploading the data in the BRAM to the DRAM after one round of convolution calculation is completed, emptying temporary result data, and then starting the next round of calculation.

Specifically, when the first layer of convolution is performed in step (2), one of three channels of the input feature map is loaded from the DRAM to perform convolution calculation, the obtained convolution result is accumulated in the convolution calculation after the input channel is switched, and the next input feature area is switched after the input feature map loaded each time needs to be calculated with all convolution kernels.

Specifically, the step (2) further includes performing pooling operation and activating operation when calculating the final result of a certain output channel, and the specific process is as follows, when the convolution result is calculated one by one for a certain line, the line result is divided into two by two, the maximum value of the two values is recorded, the logic resource on the FPGA chip is used for storing, when the next line is calculated, the output results are divided pairwise, the larger value is taken, and comparing the maximum value with the maximum value selected from the previous row, and taking the larger value of the two maximum values as the maximum value in a certain 2 x 2 area, then comparing with threshold value of RELU activation function, storing result in BRAM, thus, after convolution of the final result is carried out on a certain output channel, the pooling operation and the activating operation of the channel are simultaneously completed.

The BRAM in the steps (2) a) and (b) is set to be 512 data width, the depth is designed to be 512 points, one BRAM consumes 7.5 pieces of RAMB36E1, and the output minimum is set to be 16 bit; c) the BRAM in the system is set to be in a true dual-port mode, and the port width is 16 bits; the overhead of data storage in the entire convolutional network is two parts, feature map and weight, for a total of 425 pieces of RAM36E 1.

Specifically, the storage scheme of the weight data in the step (2) is as follows: the 1 st to 3 rd layer convolution networks share one BRAM and consume 7.5 RAM36E 1; layers 4-8 of the convolutional network use one BRAM per layer, each BRAM consumes 14.5 RAMs 36E 1; layer 9 uses one block of BRAM, consuming 7.5 RAM36E 1.

Specifically, the storage scheme of the feature map data in the step (2) is as follows: for the convolutional network of the layer 1 in a) and b), one block of BRAM is used, the convolutional networks of the layers 2-6 use two blocks of BRAM for each layer, eight blocks of BRAM are used for the layer 7, ten blocks of BRAM are used for the layer 8, and nine blocks of BRAM are used for the layer 9; one BRAM is used for each layer in c); each block BRAM consumes 7.5 RAM36E 1.

Specifically, the output of the convolutional network contains position information of 13 × 5 candidate frames, and the position information of each candidate frame consists of x, y, w and h values and respectively represents a relative abscissa value, a relative ordinate value, a relative width value and a relative height value of the center point of the candidate frame; the horizontal coordinate relative value and the vertical coordinate relative value are mapped into an absolute coordinate through a sigmoid function, and the wide coordinate relative value and the high coordinate relative value are mapped into an absolute value through an e index.

The output candidate box of the convolutional network is provided with confidence information and is used for carrying out NMS operation, and the specific calculation steps are as follows:

a) and sequentially extracting the coordinates of the center point of each candidate frame, and setting a flag bit for each candidate frame to indicate whether the whole candidate frame is reserved.

b) Selecting the first candidate frame as the comparison object to calculate the center point distance of the candidate frame of the compared object behind the domain, when a threshold value is exceeded, the flag bit of the compared candidate frame is in the valid state, which indicates that the candidate frame needs to be reserved, otherwise, the flag bit of the candidate frame is invalid, and the comparison object is not involved in the comparison of the subsequent distance, when the compared object traverses to the last frame of the queue, the comparison object is replaced, namely the candidate frame with the valid next flag bit of the comparison object.

c) And extracting all candidate frames with valid flag bits from the result memory, and generating a marked frame to be printed in the original image as a final detection result.

The invention has the following beneficial effects:

firstly, the memory on the FPGA chip is used as a data buffer area for convolution calculation, the memory outside the FPGA chip is used as main storage equipment, and all convolution layers are coupled together through the memory outside the FPGA chip.

The resource allocation method of each layer of convolution calculation can furthest exert the parallel calculation capability of the whole network, and compared with a serial convolution calculation structure, the resource allocation method of each layer of convolution calculation has the advantages of less on-chip resource and higher forward reasoning speed.

And on the FPGA chip, direct data interaction does not exist among all layers, and the connection of all layers belongs to a loose coupling relation, so that the stability of the system can be ensured.

The invention uses the simplified version to accelerate the calculation of the whole network, does not use the method of the coincidence area to calculate, but uses the distance between the central points of the two frames to simplify, and can greatly improve the speed of the NMS step.

Drawings

FIG. 1 is a schematic diagram of the computing structure and the storage structure of each layer of the present invention

FIG. 2 is a flow chart of single-layer network computation of the present invention

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

Example 1

(3) and (6) mapping the position.

a) quantifying the weight data of the original network: when quantization is carried out according to a certain decimal position of the 8-bit fixed point number, firstly establishing a decimal value comparison table at the position, namely 256 decimal numbers which comprise positive zeros and negative zeros, then quantizing the original data by using a principle of nearby, wherein the quantized value is changed, but the data is still in a floating point type of 32 bits, so that calculation in a GPU (graphics processing unit) is facilitated later to obtain the detection precision of the quantization scheme, then traversing the 8 decimal point positions to obtain the decimal point position with the minimum influence on the detection precision, and finally forming the weight quantization scheme of the layer;

b) normalizing the distribution of 0-1 for all the centralized test input feature maps, and then quantizing the input feature map of the layer by using the method in the step a);

Specifically, the calculation process of each layer of convolutional network in the step (2) is as follows: firstly, reading weight data required by the calculation in the current round from a DRAM (dynamic random access memory), and placing the weight data into a weight buffer BRAM (buffer management module); and then reading feature map data (FM) to be convolved in the layer, starting convolution calculation after all input data are prepared, uploading data in a result buffer BRAM to a DRAM after one round of convolution calculation is finished, emptying temporary result data, and then starting the next round of calculation. Because the calculation of the next layer depends on the calculation result of the previous layer, in order to enable each layer to perform calculation at the same time instead of waiting each other, a ping-pong structure is used in the DRAM to exert the parallel calculation capability in the FPGA. On the FPGA chip, direct data interaction does not exist among all layers, and the connection of all layers belongs to a loose coupling relation, so that the stability of the system can be ensured.

Specifically, when the first layer of convolution is performed in the step (2), one of three channels of the input feature map is loaded from the DRAM to perform convolution calculation, because BRAM resources on the FPGA chip are limited, and the size of the picture on the layer is larger, only continuous lines in the picture are loaded first, according to the principle of convolution calculation, the convolution results of these several rows are only the temporary results of the corresponding regions (the several rows in the same position) in a certain channel which is finally output, when calculating the convolution at the same position after switching the input channels, it needs to be accumulated with the previous temporary result, therefore, before the layer module performs convolution calculation, the temporary convolution calculation result at the same position corresponding to the output channel before is required to be retrieved from the DDR, so that after each time the convolution module calculates the result, may be added to the value in the result memory BRAM and stored in the result memory BRAM. Each loaded input feature map needs to be calculated with all convolution kernels, and then the next input feature area is switched.

In the step (2), the BRAM serving as a data buffer needs to receive the data read from the DRAM, in order to exert the maximum bandwidth of the DRAM, the write end of the BRAM is set to be 512 data width, the depth is set to be 512 points, one BRAM consumes 7.5 pieces of RAMB36E1, the output minimum is set to be 16 bits, and the width is used as the input width of the convolution operation; the BRAM of the resulting buffer not only reads data from the DRAM but also writes data to the DRAM, and is therefore set to true dual port mode with a port width of 16 bits; the overhead of data storage in the entire convolutional network is two parts, feature map and weight, for a total of 425 pieces of RAM36E 1.

Specifically, the storage scheme of the feature map data in the step (2) is as follows: for the input data buffer, the convolution network of the layer 1 uses one BRAM, the convolution network of the layers 2-6 uses two BRAMs for each layer, the layer 7 uses eight BRAMs, the layer 8 uses ten BRAMs, and the layer 9 uses nine BRAMs; for the output data buffer, one BRAM is used for each layer; each block BRAM consumes 7.5 RAMs 36E1 and the overall profile data buffer requires 337.5 RAMs 36E 1. Due to the limited resources of BRAM, ping-pong operations are only done at the output buffer, and each layer does not perform convolution calculations until the data in the input buffer is not ready. The number of channels which are distributed in parallel computing of each layer in an equal ratio according to the multiplication and addition computing quantity of each layer and the corresponding number of parallel channels of each layer are shown in a table 1.

TABLE 1 parallel computation of the computation load ratio per layer and the distribution of the number of parallel channels per layer

Layer(s)	A	II	III	Fourthly	Five of them	Six ingredients	Seven-piece	Eight-part	Nine-piece
										Ratio of	1	2.5	2.5	2.5	2.5	2.5	10	20	1
Number of PEs	1	2	2	2	2	2	8	16	1

Specifically, the convolution part is followed by a region layer operation of position mapping, the output of the convolution network contains position information of 13 × 5 candidate frames, and the position information of each candidate frame consists of x, y, w and h values and respectively represents a horizontal coordinate relative value, a vertical coordinate relative value, a width relative value and a height relative value of the center point of the candidate frame; the four values can be mapped to actual picture positions only by some processing, the relative values of the horizontal coordinate and the vertical coordinate are mapped to absolute coordinates through a sigmoid function, and because the output result is represented by an 8-bit fixed point, the corresponding output result can be quantized into a lookup table, and the mapping process is accelerated; the relative values of the width coordinate and the height coordinate are mapped into absolute values through e indexes, and the result is obtained in the form of a lookup table.

The output candidate box of the convolutional network is provided with confidence information and is used for carrying out NMS operation, and the specific calculation steps are as follows: firstly, sequentially extracting the coordinates of the center point of each candidate frame, and setting a flag bit for each candidate frame to indicate whether the whole candidate frame is reserved or not; because the central point distance is adopted as an index for judgment, according to prior information, the frame with the closer sequence is a comparison object in the candidate frames output by the network, the comparison is neglected when the sequence is farther, then the first candidate frame is selected as the central point distance of the candidate frame of the compared object behind the comparison object calculation domain, when the central point distance exceeds a threshold value, the flag bit of the candidate frame to be compared is in an effective state, the candidate frame is required to be reserved, otherwise, the flag bit of the candidate frame is invalid, the comparison of the subsequent distance is not participated, and when the compared object traverses to the last frame of the queue, the compared object, namely the candidate frame with the next valid flag bit of the compared object, is replaced; and finally, extracting all the candidate frames with valid flag bits from the result storage, and generating a marking frame to be printed in the original image as a final detection result.

The foregoing embodiments and description have been provided merely to illustrate the principles of the invention and various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A design method of a FPGA-based YOLO network forward reasoning accelerator comprises an FPGA chip and a DRAM, wherein a memory BRAM in the FPGA chip is used as a data buffer, the DRAM is used as a main storage device, and a ping-pong structure is used in the DRAM; the design method of the accelerator is characterized by comprising the following steps of:

(2) the FPGA chip performs parallel computation on nine layers of convolution networks of the YOLO; the calculation process of each layer of convolutional network in the step (2) is as follows:

c) performing convolution calculation, uploading the data in the BRAM to the DRAM after one round of convolution calculation is finished, emptying temporary result data, and then starting the next round of calculation;

the step (2) further comprises the step of performing pooling operation and activation operation when the final result of a certain output channel is calculated, and the specific process is as follows, when the convolution result is calculated one by one for a certain row, the result of the row is divided pairwise, the maximum value of the two values is recorded, the logic resource on the FPGA chip is used for storing, when the convolution result is calculated for the next row, the output result is also divided pairwise, the larger value is taken and compared with the maximum value selected from the previous row, the larger value of the two maximum values is taken as the maximum value in a certain 2 x 2 area and then compared with the threshold value of the RELU activation function, and the result is stored in the BRAM, so that after the convolution of the final result is performed on the output certain channel, the pooling operation and the activation operation of the channel are also completed;

(3) and (6) mapping the position.

2. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the quantization process of a certain layer in the step (1) is as follows:

c) taking the characteristic diagram quantized in the step b) as an input, performing forward propagation on all pictures only for the layer of convolution, loading parameters by using a quantized 32bit value, and taking an obtained output quantity as an input quantity of a next layer of network;

3. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein when the first layer of convolution is performed in step (2), one of three channels of the input feature map is loaded from the DRAM to perform convolution calculation, the obtained convolution result is accumulated in the convolution calculation after the input channel is switched, and the next input feature area is switched after the input feature map loaded each time needs to be calculated with all convolution kernels.

4. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the BRAM in the steps (2) a) and (b) is set to 512 data width, the depth is designed to 512 points, one BRAM consumes 7.5 pieces of RAMB36E1, and the output is set to 16bit at minimum; c) the BRAM in the system is set to be in a true dual-port mode, and the port width is 16 bits; the overhead of data storage in the entire convolutional network is two parts, feature map and weight, for a total of 425 pieces of RAM36E 1.

5. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the weight data storage scheme in the step (2) is as follows: the 1 st to 3 rd layer convolution networks share one BRAM and consume 7.5 RAM36E 1; layers 4-8 of the convolutional network use one BRAM per layer, each BRAM consumes 14.5 RAMs 36E 1; layer 9 uses one block of BRAM, consuming 7.5 RAM36E 1.

6. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the storage scheme of the feature map data in the step (2) is as follows: one BRAM is used for the convolution network of the layer 1 in a) and b), two BRAMs are used for each layer of the convolution network of the layers 2-6, eight BRAMs are used for the layer 7, ten BRAMs are used for the layer 8, and one BRAM is used for the layer 9; one BRAM is used for each layer in c); each block BRAM consumes 7.5 RAM36E 1.

7. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the output of the convolution network contains position information of 13 × 5 candidate frames, and the position information of each candidate frame is composed of x, y, w and h values, which respectively represent a horizontal coordinate relative value, a vertical coordinate relative value, a width relative value and a height relative value of the center point of the candidate frame; the horizontal coordinate relative value and the vertical coordinate relative value are mapped into an absolute coordinate through a sigmoid function, and the wide coordinate relative value and the high coordinate relative value are mapped into an absolute value through an e index.

8. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the output candidate box of the convolutional network has confidence information for NMS operation, and the specific calculation steps are as follows:

a) sequentially extracting the coordinates of the center point of each candidate frame, and setting a flag bit for each candidate frame to indicate whether the whole candidate frame is reserved;

b) selecting a first candidate frame as a comparison object to calculate the distance between the central points of candidate frames of the object to be compared behind the domain, when the distance exceeds a threshold value, the flag bit of the candidate frame to be compared is in an effective state, which indicates that the candidate frame needs to be reserved, otherwise, the flag bit of the candidate frame is in an ineffective state, and the candidate frame does not participate in the comparison of the subsequent distance, when the object to be compared traverses to the last frame of the queue, the comparison object is replaced, namely the candidate frame with the effective next flag bit of the previous comparison object;