CN109214504B - FPGA-based YOLO network forward reasoning accelerator design method - Google Patents

FPGA-based YOLO network forward reasoning accelerator design method Download PDF

Info

Publication number
CN109214504B
CN109214504B CN201810970836.2A CN201810970836A CN109214504B CN 109214504 B CN109214504 B CN 109214504B CN 201810970836 A CN201810970836 A CN 201810970836A CN 109214504 B CN109214504 B CN 109214504B
Authority
CN
China
Prior art keywords
layer
convolution
bram
network
fpga
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810970836.2A
Other languages
Chinese (zh)
Other versions
CN109214504A (en
Inventor
张轶凡
陈昊
应山川
李玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Original Assignee
Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications filed Critical Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Priority to CN201810970836.2A priority Critical patent/CN109214504B/en
Publication of CN109214504A publication Critical patent/CN109214504A/en
Application granted granted Critical
Publication of CN109214504B publication Critical patent/CN109214504B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a design method of a YOLO network forward reasoning accelerator based on an FPGA, wherein the accelerator comprises an FPGA chip and a DRAM, a memory BRAM in the FPGA chip is used as a data buffer, and the DRAM is used as a main storage device; the accelerator design method comprises the following steps: (1) carrying out 8-bit fixed point number quantization on original network data to obtain a decimal point position with the minimum influence on detection precision, and forming a quantization scheme, wherein the quantization process is carried out layer by layer; (2) the FPGA chip performs parallel computation on nine layers of convolution networks of the YOLO; (3) and (6) mapping the position. The method solves the technical problems that in the prior art, the increase speed of storage resources on an FPGA chip is not as fast as the scale increase of a neural network, and a general target detection network is difficult to transplant to the FPGA chip according to the traditional design thought, and achieves the purpose of using fewer resources on a chip to achieve higher speed.

Description

FPGA-based YOLO network forward reasoning accelerator design method
Technical Field
The invention relates to the technical field of deep learning and hardware structure design, in particular to a design method for carrying out forward reasoning acceleration on an FPGA (field programmable gate array) by a target detection network.
Background
In recent years, machine learning algorithms based on Convolutional neural networks (Convolutional neural networks) have been widely applied to computer vision tasks. However, for a large-scale CNN network, the characteristics of intensive computation, intensive storage and great resource consumption bring great challenges to the tasks. In the face of such high computational pressure and large data throughput, the performance of the conventional general-purpose processor is difficult to meet practical requirements, so that a hardware accelerator based on a GPU, an FPGA or an ASIC is proposed and widely put into use.
FPGA (field Programmable Gate array) is a product of further development on the basis of Programmable devices such as PAL, GAL, EPLD and the like. The circuit is used as a semi-custom circuit in the field of ASIC, not only solves the defects of the custom circuit, but also overcomes the defect of limited gate circuit quantity of the original programmable device. The FPGA adopts a new concept of a Logic Cell Array (LCA), and the FPGA internally comprises a configurable logic module CLB, an input/output module IOB and an internal connecting line, and can support one PROM to program a plurality of FPGAs. Due to the flexible reconfigurable capability, and the excellent performance to power ratio, FPGAs are becoming an important deep learning processor today.
The mainstream target detection network suitable for hardware implementation at present is YOLO (you Only Look Once), the network has high speed and simple structure, the algorithm processes the object detection problem into a regression problem, the position and the class probability of a target frame can be directly predicted from an input image by using a convolution neural network structure, and the end-to-end object detection is realized, and the structure is more suitable for hardware implementation on an FPGA. The invention discloses a general fixed point number neural network convolution accelerator hardware structure based on FPGA in CN107392309A, which comprises: the system comprises a general AXI4 high-speed bus interface, a high-parallel convolution kernel and characteristic map data cache region, a segmented convolution result cache region, a convolution calculator, a cache region controller, a state controller and a direct access controller. The invention uses on-chip storage as buffer, and off-chip memory as main storage device of data, and uses a general processor outside the chip to manage memory so as to complete the calculation of the whole convolution network. The present invention CN107463990A provides an FPGA parallel acceleration method for a convolutional neural network, which includes the following steps: (1) establishing a CNN model; (2) configuring a hardware architecture; (3) a convolution operation unit is configured. The invention uses the temporary calculation result of loading the whole network by the on-chip storage, so the scale of the deployable network is limited.
The existing neural network accelerator based on FPGA usually stores all intermediate calculation results of a network layer into an on-chip static memory, and stores weights required by a network into an off-chip dynamic memory, so that the design can cause the space of the on-chip memory to limit the network scale capable of being accelerated. At present, as the requirements for task complexity and precision become higher, the convolutional neural network is larger in scale and the total quantity of parameters is also larger, but the process of the FPGA chip and the storage resources which can be accommodated on the chip are not increased so rapidly, and if the design method is still adopted, the FPGA can not accommodate the network with the scale.
If the on-chip static memory BRAM is used as a data buffer area, and the off-chip dynamic memory DRAM is used as main data storage of the network, the network with large parameter quantity can be accommodated due to the huge storage space of the dynamic memory, and the parallel calculation of each convolution module is realized by reasonably distributing the bandwidth of the memory. The performance of this design approach depends on the bandwidth of the memory, but boosting the bandwidth of the communication is easier to achieve than stacking the memory resources on chip. The network referred by the invention is a version of YOLO-tiny, the input size of the network is 416 x 3, the network has 9 layers of convolution, the final output is a candidate frame with information of category, position and confidence coefficient, and the calculation result is mapped to the original image through a region mapping (region operation) algorithm.
Disclosure of Invention
Aiming at solving the technical problems that the increase speed of storage resources on an FPGA chip is not as fast as the scale increase of a neural network in the prior art, and a general target detection network is difficult to be transplanted to the FPGA chip according to the traditional design idea, the invention provides a YOLO network forward reasoning accelerator based on the FPGA aiming at a development platform of a YOLO-tiny network and a KC705, and the specific technical scheme is as follows:
a design method of a FPGA-based YOLO network forward reasoning accelerator comprises an FPGA chip and a DRAM, wherein a memory BRAM in the FPGA chip is used as a data buffer, the DRAM is used as a main storage device, and a ping-pong structure is used in the DRAM; the design method of the accelerator is characterized by comprising the following steps of:
(1) carrying out 8-bit fixed point number quantization on original network data to obtain a decimal point position with the minimum influence on detection precision, and forming a quantization scheme, wherein the quantization process is carried out layer by layer;
(2) the FPGA chip performs parallel computation on nine layers of convolution networks of the YOLO;
(3) and (6) mapping the position.
Specifically, the quantization process of a certain layer in step (1) is as follows:
a) quantifying the weight data of the original network: establishing 256 decimal values corresponding to a certain decimal point position of the 8-bit fixed point number, wherein the values comprise positive zeros and negative zeros, quantizing the original data by using a nearby principle, expressing the quantized numerical value by adopting a 32-bit floating point type so as to be convenient for calculation, obtaining the detection precision of the quantization scheme, traversing the 8 decimal point positions to obtain a decimal point position with the minimum influence on the detection precision, and finally forming a weight quantization scheme of the layer;
b) normalizing the distribution of 0-1 to the input feature map, and then quantizing the input feature map of the layer by using the method in the step a);
c) taking the characteristic diagram quantized in the step b) as an input, performing forward propagation on all pictures only for the layer of convolution, loading parameters by using a quantized 32bit value, and taking an obtained output quantity as an input quantity of a next layer of network.
d) And quantizing the weight and the input characteristic diagram of each layer alternately according to the method of the steps a) to c), and finally obtaining the quantization schemes of all the layers.
Specifically, the calculation process of each layer of convolutional network in the step (2) is as follows:
a) reading weight data required by the calculation of the current round from the DRAM, and placing the weight data into the BRAM;
b) reading feature map data (FM) to be convolved in the layer to complete all input data preparation;
c) and performing convolution calculation, uploading the data in the BRAM to the DRAM after one round of convolution calculation is completed, emptying temporary result data, and then starting the next round of calculation.
Specifically, when the first layer of convolution is performed in step (2), one of three channels of the input feature map is loaded from the DRAM to perform convolution calculation, the obtained convolution result is accumulated in the convolution calculation after the input channel is switched, and the next input feature area is switched after the input feature map loaded each time needs to be calculated with all convolution kernels.
Specifically, the step (2) further includes performing pooling operation and activating operation when calculating the final result of a certain output channel, and the specific process is as follows, when the convolution result is calculated one by one for a certain line, the line result is divided into two by two, the maximum value of the two values is recorded, the logic resource on the FPGA chip is used for storing, when the next line is calculated, the output results are divided pairwise, the larger value is taken, and comparing the maximum value with the maximum value selected from the previous row, and taking the larger value of the two maximum values as the maximum value in a certain 2 x 2 area, then comparing with threshold value of RELU activation function, storing result in BRAM, thus, after convolution of the final result is carried out on a certain output channel, the pooling operation and the activating operation of the channel are simultaneously completed.
The BRAM in the steps (2) a) and (b) is set to be 512 data width, the depth is designed to be 512 points, one BRAM consumes 7.5 pieces of RAMB36E1, and the output minimum is set to be 16 bit; c) the BRAM in the system is set to be in a true dual-port mode, and the port width is 16 bits; the overhead of data storage in the entire convolutional network is two parts, feature map and weight, for a total of 425 pieces of RAM36E 1.
Specifically, the storage scheme of the weight data in the step (2) is as follows: the 1 st to 3 rd layer convolution networks share one BRAM and consume 7.5 RAM36E 1; layers 4-8 of the convolutional network use one BRAM per layer, each BRAM consumes 14.5 RAMs 36E 1; layer 9 uses one block of BRAM, consuming 7.5 RAM36E 1.
Specifically, the storage scheme of the feature map data in the step (2) is as follows: for the convolutional network of the layer 1 in a) and b), one block of BRAM is used, the convolutional networks of the layers 2-6 use two blocks of BRAM for each layer, eight blocks of BRAM are used for the layer 7, ten blocks of BRAM are used for the layer 8, and nine blocks of BRAM are used for the layer 9; one BRAM is used for each layer in c); each block BRAM consumes 7.5 RAM36E 1.
Specifically, the output of the convolutional network contains position information of 13 × 5 candidate frames, and the position information of each candidate frame consists of x, y, w and h values and respectively represents a relative abscissa value, a relative ordinate value, a relative width value and a relative height value of the center point of the candidate frame; the horizontal coordinate relative value and the vertical coordinate relative value are mapped into an absolute coordinate through a sigmoid function, and the wide coordinate relative value and the high coordinate relative value are mapped into an absolute value through an e index.
The output candidate box of the convolutional network is provided with confidence information and is used for carrying out NMS operation, and the specific calculation steps are as follows:
a) and sequentially extracting the coordinates of the center point of each candidate frame, and setting a flag bit for each candidate frame to indicate whether the whole candidate frame is reserved.
b) Selecting the first candidate frame as the comparison object to calculate the center point distance of the candidate frame of the compared object behind the domain, when a threshold value is exceeded, the flag bit of the compared candidate frame is in the valid state, which indicates that the candidate frame needs to be reserved, otherwise, the flag bit of the candidate frame is invalid, and the comparison object is not involved in the comparison of the subsequent distance, when the compared object traverses to the last frame of the queue, the comparison object is replaced, namely the candidate frame with the valid next flag bit of the comparison object.
c) And extracting all candidate frames with valid flag bits from the result memory, and generating a marked frame to be printed in the original image as a final detection result.
The invention has the following beneficial effects:
firstly, the memory on the FPGA chip is used as a data buffer area for convolution calculation, the memory outside the FPGA chip is used as main storage equipment, and all convolution layers are coupled together through the memory outside the FPGA chip.
The resource allocation method of each layer of convolution calculation can furthest exert the parallel calculation capability of the whole network, and compared with a serial convolution calculation structure, the resource allocation method of each layer of convolution calculation has the advantages of less on-chip resource and higher forward reasoning speed.
And on the FPGA chip, direct data interaction does not exist among all layers, and the connection of all layers belongs to a loose coupling relation, so that the stability of the system can be ensured.
The invention uses the simplified version to accelerate the calculation of the whole network, does not use the method of the coincidence area to calculate, but uses the distance between the central points of the two frames to simplify, and can greatly improve the speed of the NMS step.
Drawings
FIG. 1 is a schematic diagram of the computing structure and the storage structure of each layer of the present invention
FIG. 2 is a flow chart of single-layer network computation of the present invention
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Example 1
A design method of a FPGA-based YOLO network forward reasoning accelerator comprises an FPGA chip and a DRAM, wherein a memory BRAM in the FPGA chip is used as a data buffer, the DRAM is used as a main storage device, and a ping-pong structure is used in the DRAM; the design method of the accelerator is characterized by comprising the following steps of:
(1) carrying out 8-bit fixed point number quantization on original network data to obtain a decimal point position with the minimum influence on detection precision, and forming a quantization scheme, wherein the quantization process is carried out layer by layer;
(2) the FPGA chip performs parallel computation on nine layers of convolution networks of the YOLO;
(3) and (6) mapping the position.
Specifically, the quantization process of a certain layer in step (1) is as follows:
a) quantifying the weight data of the original network: when quantization is carried out according to a certain decimal position of the 8-bit fixed point number, firstly establishing a decimal value comparison table at the position, namely 256 decimal numbers which comprise positive zeros and negative zeros, then quantizing the original data by using a principle of nearby, wherein the quantized value is changed, but the data is still in a floating point type of 32 bits, so that calculation in a GPU (graphics processing unit) is facilitated later to obtain the detection precision of the quantization scheme, then traversing the 8 decimal point positions to obtain the decimal point position with the minimum influence on the detection precision, and finally forming the weight quantization scheme of the layer;
b) normalizing the distribution of 0-1 for all the centralized test input feature maps, and then quantizing the input feature map of the layer by using the method in the step a);
c) taking the characteristic diagram quantized in the step b) as an input, performing forward propagation on all pictures only for the layer of convolution, loading parameters by using a quantized 32bit value, and taking an obtained output quantity as an input quantity of a next layer of network.
d) And quantizing the weight and the input characteristic diagram of each layer alternately according to the method of the steps a) to c), and finally obtaining the quantization schemes of all the layers.
Specifically, the calculation process of each layer of convolutional network in the step (2) is as follows: firstly, reading weight data required by the calculation in the current round from a DRAM (dynamic random access memory), and placing the weight data into a weight buffer BRAM (buffer management module); and then reading feature map data (FM) to be convolved in the layer, starting convolution calculation after all input data are prepared, uploading data in a result buffer BRAM to a DRAM after one round of convolution calculation is finished, emptying temporary result data, and then starting the next round of calculation. Because the calculation of the next layer depends on the calculation result of the previous layer, in order to enable each layer to perform calculation at the same time instead of waiting each other, a ping-pong structure is used in the DRAM to exert the parallel calculation capability in the FPGA. On the FPGA chip, direct data interaction does not exist among all layers, and the connection of all layers belongs to a loose coupling relation, so that the stability of the system can be ensured.
Specifically, when the first layer of convolution is performed in the step (2), one of three channels of the input feature map is loaded from the DRAM to perform convolution calculation, because BRAM resources on the FPGA chip are limited, and the size of the picture on the layer is larger, only continuous lines in the picture are loaded first, according to the principle of convolution calculation, the convolution results of these several rows are only the temporary results of the corresponding regions (the several rows in the same position) in a certain channel which is finally output, when calculating the convolution at the same position after switching the input channels, it needs to be accumulated with the previous temporary result, therefore, before the layer module performs convolution calculation, the temporary convolution calculation result at the same position corresponding to the output channel before is required to be retrieved from the DDR, so that after each time the convolution module calculates the result, may be added to the value in the result memory BRAM and stored in the result memory BRAM. Each loaded input feature map needs to be calculated with all convolution kernels, and then the next input feature area is switched.
Specifically, the step (2) further includes performing pooling operation and activating operation when calculating the final result of a certain output channel, and the specific process is as follows, when the convolution result is calculated one by one for a certain line, the line result is divided into two by two, the maximum value of the two values is recorded, the logic resource on the FPGA chip is used for storing, when the next line is calculated, the output results are divided pairwise, the larger value is taken, and comparing the maximum value with the maximum value selected from the previous row, and taking the larger value of the two maximum values as the maximum value in a certain 2 x 2 area, then comparing with threshold value of RELU activation function, storing result in BRAM, thus, after convolution of the final result is carried out on a certain output channel, the pooling operation and the activating operation of the channel are simultaneously completed.
In the step (2), the BRAM serving as a data buffer needs to receive the data read from the DRAM, in order to exert the maximum bandwidth of the DRAM, the write end of the BRAM is set to be 512 data width, the depth is set to be 512 points, one BRAM consumes 7.5 pieces of RAMB36E1, the output minimum is set to be 16 bits, and the width is used as the input width of the convolution operation; the BRAM of the resulting buffer not only reads data from the DRAM but also writes data to the DRAM, and is therefore set to true dual port mode with a port width of 16 bits; the overhead of data storage in the entire convolutional network is two parts, feature map and weight, for a total of 425 pieces of RAM36E 1.
Specifically, the storage scheme of the weight data in the step (2) is as follows: the 1 st to 3 rd layer convolution networks share one BRAM and consume 7.5 RAM36E 1; layers 4-8 of the convolutional network use one BRAM per layer, each BRAM consumes 14.5 RAMs 36E 1; layer 9 uses one block of BRAM, consuming 7.5 RAM36E 1.
Specifically, the storage scheme of the feature map data in the step (2) is as follows: for the input data buffer, the convolution network of the layer 1 uses one BRAM, the convolution network of the layers 2-6 uses two BRAMs for each layer, the layer 7 uses eight BRAMs, the layer 8 uses ten BRAMs, and the layer 9 uses nine BRAMs; for the output data buffer, one BRAM is used for each layer; each block BRAM consumes 7.5 RAMs 36E1 and the overall profile data buffer requires 337.5 RAMs 36E 1. Due to the limited resources of BRAM, ping-pong operations are only done at the output buffer, and each layer does not perform convolution calculations until the data in the input buffer is not ready. The number of channels which are distributed in parallel computing of each layer in an equal ratio according to the multiplication and addition computing quantity of each layer and the corresponding number of parallel channels of each layer are shown in a table 1.
TABLE 1 parallel computation of the computation load ratio per layer and the distribution of the number of parallel channels per layer
Layer(s) A II III Fourthly Five of them Six ingredients Seven-piece Eight-part Nine-piece
Ratio of 1 2.5 2.5 2.5 2.5 2.5 10 20 1
Number of PEs 1 2 2 2 2 2 8 16 1
Specifically, the convolution part is followed by a region layer operation of position mapping, the output of the convolution network contains position information of 13 × 5 candidate frames, and the position information of each candidate frame consists of x, y, w and h values and respectively represents a horizontal coordinate relative value, a vertical coordinate relative value, a width relative value and a height relative value of the center point of the candidate frame; the four values can be mapped to actual picture positions only by some processing, the relative values of the horizontal coordinate and the vertical coordinate are mapped to absolute coordinates through a sigmoid function, and because the output result is represented by an 8-bit fixed point, the corresponding output result can be quantized into a lookup table, and the mapping process is accelerated; the relative values of the width coordinate and the height coordinate are mapped into absolute values through e indexes, and the result is obtained in the form of a lookup table.
The output candidate box of the convolutional network is provided with confidence information and is used for carrying out NMS operation, and the specific calculation steps are as follows: firstly, sequentially extracting the coordinates of the center point of each candidate frame, and setting a flag bit for each candidate frame to indicate whether the whole candidate frame is reserved or not; because the central point distance is adopted as an index for judgment, according to prior information, the frame with the closer sequence is a comparison object in the candidate frames output by the network, the comparison is neglected when the sequence is farther, then the first candidate frame is selected as the central point distance of the candidate frame of the compared object behind the comparison object calculation domain, when the central point distance exceeds a threshold value, the flag bit of the candidate frame to be compared is in an effective state, the candidate frame is required to be reserved, otherwise, the flag bit of the candidate frame is invalid, the comparison of the subsequent distance is not participated, and when the compared object traverses to the last frame of the queue, the compared object, namely the candidate frame with the next valid flag bit of the compared object, is replaced; and finally, extracting all the candidate frames with valid flag bits from the result storage, and generating a marking frame to be printed in the original image as a final detection result.
The foregoing embodiments and description have been provided merely to illustrate the principles of the invention and various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. A design method of a FPGA-based YOLO network forward reasoning accelerator comprises an FPGA chip and a DRAM, wherein a memory BRAM in the FPGA chip is used as a data buffer, the DRAM is used as a main storage device, and a ping-pong structure is used in the DRAM; the design method of the accelerator is characterized by comprising the following steps of:
(1) carrying out 8-bit fixed point number quantization on original network data to obtain a decimal point position with the minimum influence on detection precision, and forming a quantization scheme, wherein the quantization process is carried out layer by layer;
(2) the FPGA chip performs parallel computation on nine layers of convolution networks of the YOLO; the calculation process of each layer of convolutional network in the step (2) is as follows:
a) reading weight data required by the calculation of the current round from the DRAM, and placing the weight data into the BRAM;
b) reading feature map data (FM) to be convolved in the layer to complete all input data preparation;
c) performing convolution calculation, uploading the data in the BRAM to the DRAM after one round of convolution calculation is finished, emptying temporary result data, and then starting the next round of calculation;
the step (2) further comprises the step of performing pooling operation and activation operation when the final result of a certain output channel is calculated, and the specific process is as follows, when the convolution result is calculated one by one for a certain row, the result of the row is divided pairwise, the maximum value of the two values is recorded, the logic resource on the FPGA chip is used for storing, when the convolution result is calculated for the next row, the output result is also divided pairwise, the larger value is taken and compared with the maximum value selected from the previous row, the larger value of the two maximum values is taken as the maximum value in a certain 2 x 2 area and then compared with the threshold value of the RELU activation function, and the result is stored in the BRAM, so that after the convolution of the final result is performed on the output certain channel, the pooling operation and the activation operation of the channel are also completed;
(3) and (6) mapping the position.
2. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the quantization process of a certain layer in the step (1) is as follows:
a) quantifying the weight data of the original network: establishing 256 decimal values corresponding to a certain decimal point position of the 8-bit fixed point number, wherein the values comprise positive zeros and negative zeros, quantizing the original data by using a nearby principle, expressing the quantized numerical value by adopting a 32-bit floating point type so as to be convenient for calculation, obtaining the detection precision of the quantization scheme, traversing the 8 decimal point positions to obtain a decimal point position with the minimum influence on the detection precision, and finally forming a weight quantization scheme of the layer;
b) normalizing the distribution of 0-1 to the input feature map, and then quantizing the input feature map of the layer by using the method in the step a);
c) taking the characteristic diagram quantized in the step b) as an input, performing forward propagation on all pictures only for the layer of convolution, loading parameters by using a quantized 32bit value, and taking an obtained output quantity as an input quantity of a next layer of network;
d) and quantizing the weight and the input characteristic diagram of each layer alternately according to the method of the steps a) to c), and finally obtaining the quantization schemes of all the layers.
3. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein when the first layer of convolution is performed in step (2), one of three channels of the input feature map is loaded from the DRAM to perform convolution calculation, the obtained convolution result is accumulated in the convolution calculation after the input channel is switched, and the next input feature area is switched after the input feature map loaded each time needs to be calculated with all convolution kernels.
4. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the BRAM in the steps (2) a) and (b) is set to 512 data width, the depth is designed to 512 points, one BRAM consumes 7.5 pieces of RAMB36E1, and the output is set to 16bit at minimum; c) the BRAM in the system is set to be in a true dual-port mode, and the port width is 16 bits; the overhead of data storage in the entire convolutional network is two parts, feature map and weight, for a total of 425 pieces of RAM36E 1.
5. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the weight data storage scheme in the step (2) is as follows: the 1 st to 3 rd layer convolution networks share one BRAM and consume 7.5 RAM36E 1; layers 4-8 of the convolutional network use one BRAM per layer, each BRAM consumes 14.5 RAMs 36E 1; layer 9 uses one block of BRAM, consuming 7.5 RAM36E 1.
6. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the storage scheme of the feature map data in the step (2) is as follows: one BRAM is used for the convolution network of the layer 1 in a) and b), two BRAMs are used for each layer of the convolution network of the layers 2-6, eight BRAMs are used for the layer 7, ten BRAMs are used for the layer 8, and one BRAM is used for the layer 9; one BRAM is used for each layer in c); each block BRAM consumes 7.5 RAM36E 1.
7. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the output of the convolution network contains position information of 13 × 5 candidate frames, and the position information of each candidate frame is composed of x, y, w and h values, which respectively represent a horizontal coordinate relative value, a vertical coordinate relative value, a width relative value and a height relative value of the center point of the candidate frame; the horizontal coordinate relative value and the vertical coordinate relative value are mapped into an absolute coordinate through a sigmoid function, and the wide coordinate relative value and the high coordinate relative value are mapped into an absolute value through an e index.
8. The design method of the FPGA-based YOLO network forward reasoning accelerator as claimed in claim 1, wherein the output candidate box of the convolutional network has confidence information for NMS operation, and the specific calculation steps are as follows:
a) sequentially extracting the coordinates of the center point of each candidate frame, and setting a flag bit for each candidate frame to indicate whether the whole candidate frame is reserved;
b) selecting a first candidate frame as a comparison object to calculate the distance between the central points of candidate frames of the object to be compared behind the domain, when the distance exceeds a threshold value, the flag bit of the candidate frame to be compared is in an effective state, which indicates that the candidate frame needs to be reserved, otherwise, the flag bit of the candidate frame is in an ineffective state, and the candidate frame does not participate in the comparison of the subsequent distance, when the object to be compared traverses to the last frame of the queue, the comparison object is replaced, namely the candidate frame with the effective next flag bit of the previous comparison object;
c) and extracting all candidate frames with valid flag bits from the result memory, and generating a marked frame to be printed in the original image as a final detection result.
CN201810970836.2A 2018-08-24 2018-08-24 FPGA-based YOLO network forward reasoning accelerator design method Active CN109214504B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810970836.2A CN109214504B (en) 2018-08-24 2018-08-24 FPGA-based YOLO network forward reasoning accelerator design method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810970836.2A CN109214504B (en) 2018-08-24 2018-08-24 FPGA-based YOLO network forward reasoning accelerator design method

Publications (2)

Publication Number Publication Date
CN109214504A CN109214504A (en) 2019-01-15
CN109214504B true CN109214504B (en) 2020-09-04

Family

ID=64989693

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810970836.2A Active CN109214504B (en) 2018-08-24 2018-08-24 FPGA-based YOLO network forward reasoning accelerator design method

Country Status (1)

Country Link
CN (1) CN109214504B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175670B (en) * 2019-04-09 2020-12-08 华中科技大学 Method and system for realizing YOLOv2 detection network based on FPGA
CN110033086B (en) * 2019-04-15 2022-03-22 广州异构智能科技有限公司 Hardware accelerator for neural network convolution operations
CN110222835A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of convolutional neural networks hardware system and operation method based on zero value detection
CN110263925B (en) * 2019-06-04 2022-03-15 电子科技大学 Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
CN112052935B (en) * 2019-06-06 2024-06-14 奇景光电股份有限公司 Convolutional neural network system
CN112085191B (en) * 2019-06-12 2024-04-02 上海寒武纪信息科技有限公司 Method for determining quantization parameter of neural network and related product
CN110555516B (en) * 2019-08-27 2023-10-27 合肥辉羲智能科技有限公司 Method for realizing low-delay hardware accelerator of YOLOv2-tiny neural network based on FPGA
WO2021102946A1 (en) * 2019-11-29 2021-06-03 深圳市大疆创新科技有限公司 Computing apparatus and method, processor, and movable device
CN113297128B (en) * 2020-02-24 2023-10-31 中科寒武纪科技股份有限公司 Data processing method, device, computer equipment and storage medium
CN111752713B (en) 2020-06-28 2022-08-05 浪潮电子信息产业股份有限公司 Method, device and equipment for balancing load of model parallel training task and storage medium
CN111814675B (en) * 2020-07-08 2023-09-29 上海雪湖科技有限公司 Convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA
CN113065303B (en) * 2021-03-06 2024-02-02 杭州电子科技大学 DSCNN accelerator layering verification method based on FPGA
CN115049907B (en) * 2022-08-17 2022-10-28 四川迪晟新达类脑智能技术有限公司 FPGA-based YOLOV4 target detection network implementation method
CN116737382B (en) * 2023-06-20 2024-01-02 中国人民解放军国防科技大学 Neural network reasoning acceleration method based on area folding

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454546B1 (en) * 2006-01-27 2008-11-18 Xilinx, Inc. Architecture for dynamically reprogrammable arbitration using memory
CN106529517A (en) * 2016-12-30 2017-03-22 北京旷视科技有限公司 Image processing method and image processing device
CN106650592A (en) * 2016-10-05 2017-05-10 北京深鉴智能科技有限公司 Target tracking system
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107451659A (en) * 2017-07-27 2017-12-08 清华大学 Neutral net accelerator and its implementation for bit wide subregion
CN108108809A (en) * 2018-03-05 2018-06-01 山东领能电子科技有限公司 A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork
CN108182471A (en) * 2018-01-24 2018-06-19 上海岳芯电子科技有限公司 A kind of convolutional neural networks reasoning accelerator and method
EP3352113A1 (en) * 2017-01-18 2018-07-25 Hitachi, Ltd. Calculation system and calculation method of neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10540588B2 (en) * 2015-06-29 2020-01-21 Microsoft Technology Licensing, Llc Deep neural network processing on hardware accelerators with stacked memory

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454546B1 (en) * 2006-01-27 2008-11-18 Xilinx, Inc. Architecture for dynamically reprogrammable arbitration using memory
CN106650592A (en) * 2016-10-05 2017-05-10 北京深鉴智能科技有限公司 Target tracking system
CN106529517A (en) * 2016-12-30 2017-03-22 北京旷视科技有限公司 Image processing method and image processing device
EP3352113A1 (en) * 2017-01-18 2018-07-25 Hitachi, Ltd. Calculation system and calculation method of neural network
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107451659A (en) * 2017-07-27 2017-12-08 清华大学 Neutral net accelerator and its implementation for bit wide subregion
CN108182471A (en) * 2018-01-24 2018-06-19 上海岳芯电子科技有限公司 A kind of convolutional neural networks reasoning accelerator and method
CN108108809A (en) * 2018-03-05 2018-06-01 山东领能电子科技有限公司 A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Hardware Implementation and Optimization of Tiny-YOLO Network;Jing Ma et al;《International Forum on Digital TV and Wireless Multimedia Communications》;20180203;第1、2、3.1、6.2节,图1、2、6,表3 *
Improving the speed of neural networks on CPUs;Vincent Vanhoucke et al;《Deep learning and unsupervised feature learning workshop》;20111231;1-8 *
基于FPGA的卷积神经网络并行结构研究;陆志坚;《中国博士学位论文全文数据库信息科技辑》;20140415;第2014年卷(第4期);I140-12 *

Also Published As

Publication number Publication date
CN109214504A (en) 2019-01-15

Similar Documents

Publication Publication Date Title
CN109214504B (en) FPGA-based YOLO network forward reasoning accelerator design method
US11907760B2 (en) Systems and methods of memory allocation for neural networks
US11580367B2 (en) Method and system for processing neural network
US20190236049A1 (en) Performing concurrent operations in a processing element
CN107203807B (en) On-chip cache bandwidth balancing method, system and device of neural network accelerator
CN108229671B (en) System and method for reducing storage bandwidth requirement of external data of accelerator
CN111667051A (en) Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
CN111199273A (en) Convolution calculation method, device, equipment and storage medium
CN110175670B (en) Method and system for realizing YOLOv2 detection network based on FPGA
CN105739951B (en) A kind of L1 minimization problem fast solution methods based on GPU
CN113361695B (en) Convolutional neural network accelerator
CN112668708B (en) Convolution operation device for improving data utilization rate
CN111768458A (en) Sparse image processing method based on convolutional neural network
WO2021147276A1 (en) Data processing method and apparatus, and chip, electronic device and storage medium
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
CN116720549A (en) FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache
CN110490308B (en) Design method of acceleration library, terminal equipment and storage medium
CN111210004B (en) Convolution calculation method, convolution calculation device and terminal equipment
CN115668222A (en) Data processing method and device of neural network
CN111767243A (en) Data processing method, related device and computer readable medium
CN115394336A (en) Storage and computation FPGA (field programmable Gate array) framework
Wu et al. Skeletongcn: a simple yet effective accelerator for gcn training
CN115982418B (en) Method for improving super-division operation performance of AI (advanced technology attachment) computing chip
CN110490312B (en) Pooling calculation method and circuit
CN112200310A (en) Intelligent processor, data processing method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant