CN115049907B - FPGA-based YOLOV4 target detection network implementation method - Google Patents
FPGA-based YOLOV4 target detection network implementation method Download PDFInfo
- Publication number
- CN115049907B CN115049907B CN202210983908.3A CN202210983908A CN115049907B CN 115049907 B CN115049907 B CN 115049907B CN 202210983908 A CN202210983908 A CN 202210983908A CN 115049907 B CN115049907 B CN 115049907B
- Authority
- CN
- China
- Prior art keywords
- conv
- fpga
- calculation unit
- calculation
- yolov4
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Complex Calculations (AREA)
- Radar Systems Or Details Thereof (AREA)
Abstract
The invention discloses a method for realizing a YOLOV4 target detection network based on FPGA, which comprises the steps of training the YOLOV4 target detection network; after the training of the YOLOV4 target detection network is completed, a YOLOV4 target detection model is built on the FPGA; and identifying a preset target. The method for building the Yolov4 target detection model comprises the following substeps: decomposing a computing unit in a YOLOV4 network structure layer; constructing a Mish function computing unit on an FPGA; constructing a Leaky Relu function calculation unit on the FPGA; constructing a BN batch normalization calculation unit on the FPGA; constructing a CONV calculation unit on the FPGA; and constructing a storage area on the FPGA. The invention realizes convenience by using FPGA hardware resources and logic, and improves the accuracy and efficiency of the YOLOV4 target detection model on the premise of ensuring the calculation capacity.
Description
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to a method for realizing a YOLOV4 target detection network based on an FPGA.
Background
At present, most target detection networks based on a YOLO architecture are realized based on embedded platforms, and the YOLOV3 architecture is mainly applied to embedded platforms such as Haisi 3559, rui-core micro 1808 and Rui-core micro 3399. The implementation of the YOLOV3 architecture in the embedded platform has the following defects: compared with the YOLOV4, the network structure layer of the YOLOV3 has more layers and is more complex, and the identification accuracy of the target is lower; the embedded platform is not computationally efficient, and it is difficult to achieve ideal target recognition accuracy while achieving full frame rate applications.
The YOLOV4 network structure includes an input layer, a backbone layer, a neck layer, and a prediction layer, wherein the backbone layer, neck layer, and prediction layer are architectures that require a design implementation. The bone stem layer comprises a CSPn computing unit, the CSPn computing unit is a cross-stage part connection computing unit, n represents the number of residual error units, the CSPn computing unit comprises a CBM computing unit and n residual error units, the CBM computing unit represents a computing unit consisting of a CONV computing unit, a BN Batch Normalization computing unit and a Mish function computing unit (Mish activation function), the CONV computing unit represents a unit for convolution operation, and the BN Batch Normalization computing unit represents a unit for BN (Batch Normalization) Batch Normalization operation. The neck layer comprises a plurality of CBL calculation units, wherein each CBL calculation unit represents a calculation unit consisting of a CONV calculation unit, a BN batch normalization calculation unit and a Leaky Relu function calculation unit (Leaky Relu activation function). The prediction layer includes a plurality of CBL calculation units and a plurality of CONV calculation units.
The FPGA platform has abundant DSP computing resources and can provide good computing power. In addition, the FPGA platform also has rich DDR storage resources, and can realize the rapid storage and reading of image data, so that the realization of the YOLOV4 target detection network on the FPGA has feasibility, and therefore, the research on the realization of the YOLOV4 target detection network based on the FPGA is necessary.
Disclosure of Invention
The invention aims to overcome one or more defects in the prior art and provides a method for realizing a Yolov4 target detection network based on an FPGA.
The purpose of the invention is realized by the following technical scheme:
a method for realizing a YOLOV4 target detection network based on an FPGA (field programmable gate array), wherein the FPGA comprises a DDR (double data rate) storage module, an LUT (look-up table) module and a DSP (digital signal processor) module, and the method comprises the following steps:
s1, training a YOLOV4 target detection network on a server based on a training sample;
s2, after the training of the YOLOV4 target detection network is completed, a YOLOV4 target detection model for identifying a preset target in a prediction sample is built on the FPGA, and the method comprises the following substeps:
s21, decomposing a computing unit in the YOLOV4 network structure layer: decomposing a YOLOV4 network structure layer to obtain a plurality of computing units, wherein the computing units comprise a Mish function computing unit, a Leaky Relu function computing unit, a BN batch normalization computing unit and a CONV computing unit;
s22, constructing a Mish function calculation unit on the FPGA: pre-storing an output value of a Mish function in the LUT module, and searching an output value corresponding to the input value in the LUT module by taking the input value of the Mish function as an address;
s23, constructing a Leaky Relu function calculation unit on the FPGA: a Leaky Relu logic module is constructed on the FPGA,
the Leaky Relu logic module is used for receiving a first input value of a Leaky Relu function, generating a first output value of the Leaky Relu function corresponding to the first input value according to the first input value, wherein the first input value is larger than zero, and the first output value corresponding to the first input value is equal to the first input value; pre-storing a second output value of the Leaky Relu function in the LUT module, and using a second input value of the Leaky Relu function as an address to search a second output value corresponding to the second input value in the LUT module, wherein the second input value is less than or equal to zero;
s24, constructing a BN batch normalization calculation unit on the FPGA: constructing a BN batch normalization logic module on an FPGA, and pre-storing variances and mean values for batch normalization processing in the BN batch normalization logic module, wherein the variances for batch normalization processing are mean values of all the variances of the training samples, and the mean values for batch normalization processing are mean values of all the means of the training samples;
s25, constructing a CONV calculation unit on the FPGA: according to the successive time of convolution operation of each CONV calculation unit, constructing a CONV calculation unit at a first time on the DSP module, and after the calculation of the CONV calculation unit at the first time is completed, reconstructing a CONV calculation unit at the next time on the DSP module;
s26, constructing a storage area on the FPGA: constructing a storage area on the DDR storage module, wherein the storage area is used for storing an image matrix of a CONV calculation unit input at a first moment, and the image matrix is generated according to a prediction sample; before the image matrix is stored, grouping all rows of the image matrix to obtain a plurality of array element groups, respectively performing 90-degree matrix transposition on each array element group, writing the transposed array element groups into a storage row of a storage area according to a row sequence, wherein the written storage rows of each array element group are different, and during convolution operation, reading the array elements of the array element groups in the storage rows and inputting the read array elements into a CONV (ConV) calculation unit at a first moment;
and S3, inputting the prediction sample into the built YOLOV4 target detection model to identify a preset target.
Preferably, decomposing the YOLOV4 network structure layer to obtain a plurality of computing units specifically includes:
decomposing a skeleton layer of the YOLOV4 network structure layer to obtain a CSPn calculation unit;
decomposing the neck layer of the YOLOV4 network structure layer to obtain a CBL calculation unit;
decomposing a prediction layer of a YOLOV4 network structure layer to obtain a CBL calculation unit and a CONV calculation unit;
decomposing the CSPn calculation unit to obtain the CONV calculation unit, the BN batch normalization calculation unit and the Mish function
A calculation unit;
and decomposing the CBL calculation unit to obtain a CONV calculation unit, a BN batch normalization calculation unit and a Leaky Relu function calculation unit.
Preferably, the input value of the Mish function is a plurality of discrete numerical values in an interval [ -6,6], each discrete numerical value is arranged in a size sequence, and the absolute value of the difference value between two adjacent discrete numerical values is 0.1.
Preferably, the second input value of the leak Relu function is a plurality of discrete values in the interval [ -10,0], each discrete value is arranged in order of magnitude, and the absolute value of the difference between two adjacent discrete values is 0.1.
Preferably, the DSP module includes N DSP operation submodules;
the constructing a CONV calculating unit at a first moment on the DSP module specifically includes:
calculating the convolution kernel size n1 of the unit from the CONV at the first instantn1, constructing a first sub-CONV calculating unit from the first DSP calculating submodule to the n1 × n1 DSP calculating submodules, wherein the sub-CONV calculating unit is used for performing single-period convolution operation of n1 × n1 convolution kernels, then constructing a second sub-CONV calculating unit from the n1 × n1+1 DSP calculating submodules to the 2 × n1 DSP calculating submodules, and so on, and completing construction of M1 sub-CONV calculating units on the DSP modules; if the first calculation parameter A is 1 If the value is a positive integer, M1 is taken as a first calculation parameter, and if the value is the first calculation parameter A 1 If not, M1 is less than the first calculation parameter A 1 Of positive integers of (1), wherein。
Preferably, the grouping is performed on all rows of the image matrix to obtain a plurality of array element groups, then 90-degree matrix transposition is performed on each array element group, then the array element groups after transposition are written into one storage row of the storage area in a row sequence, the storage rows written into each array element group are different, during convolution operation, the array elements of the array element groups in the storage rows are read, and the read array elements are input into the CONV calculation unit at the first time, which specifically includes:
grouping all rows of the image matrix to obtain P array element groups;
calculating a second calculation parameterWhere L denotes the total number of rows of the image matrix, bit 1 Bus Bit width, bit, representing a memory region 0 The pixel bit width of the image matrix is represented, if the second calculation parameter is a positive integer, the number P of the array elements is the second calculation parameter, and the number of rows of each array element isOtherwise, the number P of the array elements is a positive integer which is larger than the second calculation parameter and has the minimum difference with the second calculation parameter, and the row number from the first array element to the P-1 array element is L 1 Of P thNumber of rows of array element group;
Performing 90-degree matrix transposition on each array element group;
writing the array element groups after the conversion into a storage line of a storage area according to the line sequence, wherein the storage lines written by the array element groups are different;
when a first round of single-period convolution operation is carried out, M1 array elements are read from a storage area and output to a CONV computing unit at a first moment, and each sub-CONV computing unit receives one array element of the M1 array elements;
and when the next round of single-period convolution operation is carried out, if the number of the residual array elements in the storage area is less than M1, reading all the residual array elements from the storage area and outputting the residual array elements to the CONV calculating unit at the first moment, otherwise, reading M1 array elements from the storage area and outputting the M1 array elements to the CONV calculating unit at the first moment, until all the array elements of P storage lines are output to the CONV calculating unit at the first moment from the storage area.
The invention has the beneficial effects that:
(1) Decomposing a YOLOV4 network structure layer to obtain four types of basic computing units which are respectively a Mish function computing unit, a Leaky Relu function computing unit, a BN batch normalization computing unit and a CONV computing unit, wherein the four types of basic computing units complete all computations of a YOLOV4 target detection network; the Mish function computing unit is built in a lookup table mode, and the part of the Leaky Relu function computing unit, of which the input value is less than or equal to 0, is built in a lookup table mode, so that LUT (look-up table) module resources are fully utilized, and the computing efficiency of the Mish function computing unit and the Leaky Relu function computing unit is improved; by constructing the BN batch normalization logic module, and through the fixed presetting of the variance and the mean value, the logic realization is only fixed coefficient division operation, and the calculation efficiency of the BN batch normalization calculation unit is improved; constructing the CONV calculation units of each successive moment in a time-sharing manner on the same DSP module in a time-sharing multiplexing manner, so that DSP calculation resources of the FPGA meet the calculation force requirement of the YOLOV4 network; in addition, the image matrix is grouped, matrix transposition is carried out before storage aiming at each array element group, the array element group after matrix transposition is stored according to the row sequence and is stored to different storage rows of the storage area, therefore, the storage area for storing the image matrix is constructed on the DDR storage module, if transposition is not carried out, one round of single-period convolution operation can only use one DSP operation submodule to carry out one convolution operation, the reordering aiming at the intra-frame image array elements is carried out, a port connected with the storage row can simultaneously read a plurality of array elements and load the array elements into a corresponding CONV calculation unit, and the utilization rate of DSP operation resources is improved.
The building process fully utilizes hardware resources of the FPGA and convenience of FPGA logic realization, achieves building of the YOLOV4 target detection model based on the FPGA, guarantees the computing power of the YOLOV4 target detection model, achieves improvement of target recognition accuracy based on the computing power guarantee, and achieves improvement of target recognition efficiency based on efficient flow reuse of DSP computing resources.
(2) When the CONV calculation unit at each moment is constructed on the DSP module, the number of single-period convolution operations which can be carried out simultaneously is determined according to the size of a convolution kernel and the principle that a plurality of DSP operation sub-modules in the DSP module are simultaneously utilized and are not idle as much as possible, and the sub-CONV calculation units which are used for the single-period convolution operations and have the corresponding number are constructed, so that the full utilization of DSP operation resources in the FPGA is realized, and the efficiency maximization is further realized on the basis of time division multiplexing.
(3) And by combining the DDR high-efficiency reading flow characteristics, during single-period convolution operation, array elements with the number consistent with that of the sub-CONV computing units are read from a storage area and are transmitted to the sub-CONV computing units, each CONV computing unit receives one array element, data support is provided for the CONV computing units to simultaneously perform single-period convolution operation, and the maximization of the target identification efficiency of the Yolov4 target detection network model is ensured.
Drawings
Fig. 1 is a schematic diagram of decomposition of a YOLOV4 network structure layer;
FIG. 2 is a schematic diagram of an image matrix;
FIG. 3 is a schematic diagram of image matrix storage in a storage area after image matrix replacement.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
Referring to fig. 1 to fig. 3, the present embodiment provides a method for implementing a YOLOV4 target detection network based on an FPGA, where the FPGA includes a DDR memory module, an LUT module, and a DSP module. The LUT module represents a Look-Up-Table lookup Table module. The method specifically comprises the following steps:
s1, training a YOLOV4 target detection network on a server based on training samples.
S2, after the training of the YOLOV4 target detection network is completed, a YOLOV4 target detection model for identifying a preset target in a prediction sample is built on the FPGA, and the method comprises the following substeps:
s21, decomposing a computing unit in a YOLOV4 network structure layer: decomposing the YOLOV4 network structure layer to obtain a plurality of computing units, wherein the computing units comprise a Mish function computing unit (Mish activation function), a Leaky Relu function computing unit (Leaky Relu activation function), a BN batch normalization computing unit and a CONV computing unit.
Optionally, decomposing the YOLOV4 network structure layer to obtain a plurality of computing units, specifically including the following sub-steps: decomposing a bone dry layer of the YOLOV4 network structure layer to obtain a CSPn calculation unit; decomposing the neck layer of the YOLOV4 network structure layer to obtain a CBL calculation unit; decomposing a prediction layer of a YOLOV4 network structure layer to obtain a CBL calculation unit and a CONV calculation unit; decomposing the CSPn calculating unit to obtain a CONV calculating unit, a BN batch normalization calculating unit and a Mish function calculating unit; and decomposing the CBL calculation unit to obtain a CONV calculation unit, a BN batch normalization calculation unit and a Leaky Relu function calculation unit.
S22, constructing a Mish function calculation unit on the FPGA: and pre-storing the output value of the Mish function in the LUT module, and searching the output value corresponding to the input value in the LUT module by taking the input value of the Mish function as an address. The expression of the Mish function is:。
optionally, the value range of the input value of the hash function is [ -6,6], the value range is discrete values, each discrete value is arranged according to the magnitude sequence, and the absolute value of the difference between two adjacent input values of the hash function is 0.1. The mapping relationship between the partial input values and the partial output values of the Mish function computing unit is shown in the following table I. Therefore, compared with the realization of the Mish function calculation unit by constructing other logic modules on the FPGA, the realization of the Mish function calculation unit by the LUT lookup table mode is very convenient, and the calculation efficiency is also improved.
Watch 1
S23, constructing a Leaky Relu function calculation unit on the FPGA: and constructing a Leaky Relu logic module on the FPGA, wherein the Leaky Relu logic module is used for receiving a first input value of a Leaky Relu function, generating a first output value of the Leaky Relu function corresponding to the first input value according to the first input value, the first input value is greater than zero, and the first output value corresponding to the first input value is equal to the first input value. Based on the convenience of logic implementation on the FPGA, the input value and the output value of the Leaky Relu logic module are the same, so that the logic module is easy to implement on the FPGA. Pre-storing a second output value of the Leaky Relu function in the LUT module, using a second input value of the Leaky Relu function as an address to search a second output value corresponding to the second input value in the LUT module, wherein the second input value is smaller than the equal valueAt zero. The expression of the leak Relu function is:。
optionally, the value interval of the second input value of the leak Relu function is [ -10,0], and the second input value is discrete values, each discrete value is arranged according to the size sequence, and the absolute value of the difference between two adjacent second input values of the leak Relu function is 0.1. The mapping relationship between the partial input values and the partial output values of the leakage Relu function calculation unit is shown in the following Table II. Therefore, the calculation of the part of the input value of the Leaky Relu function, which is less than or equal to 0, is realized based on the LUT module in the FPGA module, so that the method is more convenient compared with the construction of other logic modules on the FPGA, and the calculation efficiency is also improved.
Watch two
S24, constructing a BN batch normalization calculation unit on the FPGA: and constructing a BN batch normalization logic module on the FPGA, and pre-storing the variance and the mean value for batch normalization in the BN batch normalization logic module, wherein the variance for batch normalization is the mean value of the variances of all the training samples, and the mean value for batch normalization is the mean value of the mean values of all the training samples. After the Yolov4 target detection model is constructed, when target identification is performed, the calculation formula of batch normalization processing of the sample is as follows:whereinA sample is represented by a sample of the sample,represents the samples after the batch normalization process,represents the mean value of the batch normalization process,the variance of the batch normalization process is represented,is a constant. Therefore, the BN batch normalization logic module only performs division operation of fixed coefficients when performing batch normalization processing on samples, the efficiency of batch normalization processing is improved, and the BN batch normalization logic module is easy to realize on an FPGA based on the convenience of logic realization on the FPGA.
S25, constructing a CONV calculation unit on the FPGA: and according to the successive time of convolution operation of each CONV calculation unit, constructing a CONV calculation unit at the first time on the DSP module, and after the calculation of the CONV calculation unit at the first time is completed, reconstructing a CONV calculation unit at the next time on the DSP module. Based on the decomposition of the YOLOV4 network structure layer in S21, the entire YOLOV4 network structure layer includes a plurality of CONV calculation units, and each CONV calculation unit is distributed in the YOLOV4 network structure layer in sequence. The method comprises the steps that a first-time CONV computing unit is built on a DSP module on an FPGA in a time-sharing multiplexing mode, after the first-time CONV computing unit is computed, the DSP module is released, wherein the releasing means that operations such as clearing and the like are carried out on the numerical value of a convolution kernel loaded in the first-time CONV computing unit, then a next-time CONV computing unit is built on the DSP module, correspondingly, the numerical value of the convolution kernel of the first-time CONV computing unit is loaded in the first-time CONV computing unit, and the like until the last-time CONV computing unit is computed, the DSP module is released, and then a last-time CONV computing unit is built on the DSP module. Because the calculation amount of the CONV calculation unit is the largest in the whole YOLOV4 network structure layer, the high calculation force required by the YOLOV4 network structure layer is ensured by time-sharing multiplexing the DSP module.
Optionally, the DSP module includes N DSP operation submodules, and the constructing of the CONV calculation unit at the first time on the DSP module specifically includes the following substeps:
according to the convolution kernel size n1 x n1 of the CONV calculation unit at the first moment, a first sub-CONV calculation unit is constructed on a first DSP calculation sub-module to an n1 x n1 DSP calculation sub-module, the sub-CONV calculation unit is used for performing single-period convolution operation on n1 x n1 convolution kernels, then a second sub-CONV calculation unit is constructed on an n1 x n1+1 DSP calculation sub-module to a 2 n1 x n1 DSP calculation sub-module, and by analogy, construction of M1 sub-CONV calculation units on the DSP module is completed; if the first calculation parameter A is 1 If the value is a positive integer, M1 is taken as a first calculation parameter, and if the value is the first calculation parameter A 1 If not, M1 is less than the first calculation parameter A 1 Of positive integers of (1), wherein。
It is known that the reconstruction of the CONV calculation unit at the next moment on the DSP module specifically includes the following sub-steps:
according to the convolution kernel size n2 x n2 of the CONV calculation unit at the next moment, a first sub-CONV calculation unit is constructed on the first DSP calculation sub-module to the n2 x n2 DSP calculation sub-modules, the sub-CONV calculation unit is used for performing single-period convolution operation on n2 x n2 convolution kernels, then a second sub-CONV calculation unit is constructed on the n2 x n2+1 DSP calculation sub-modules to the 2 n2 x n2 DSP calculation sub-modules, and by analogy, construction of M2 sub-CONV calculation units on the DSP modules is completed; if the third calculation parameter A 3 If the number of the first calculation parameter is positive integer, M2 is taken as a third calculation parameter, and if the third calculation parameter A is positive integer 3 If not, M2 is less than the third calculation parameter A 3 Of positive integers of (1), wherein。
The single-period convolution operation refers to a convolution operation performed by a convolution kernel and a pixel, for example, the convolution kernel size n1 × n1, and the single-period convolution operation includes n1 × n1 times of multiplication operations and n1 × n1-1 times of addition operations.
S26, constructing a storage area on the FPGA: constructing a storage area on the DDR storage module, wherein the storage area is used for storing an image matrix of the CONV calculation unit input at a first moment, and the image matrix is generated according to a prediction sample; before the image matrix is stored, all rows of the image matrix are grouped to obtain a plurality of array elements, then 90-degree matrix transposition is carried out on each array element group, the array element groups after transposition are written into a storage row of a storage area according to the row sequence, the storage rows written into by each array element group are different, during convolution operation, the array elements of the array elements in the storage rows are read, and the read array elements are input into a CONV computing unit at a first moment.
Optionally, all rows of the image matrix are grouped to obtain a plurality of array elements, then each array element group is subjected to 90-degree matrix transposition, then the array element group after being transposed is written into a storage row of the storage area according to a row sequence, the storage rows written into by each array element group are different, during convolution operation, the array elements of the array elements in the storage row are read, and the read array elements are input into the CONV calculation unit at the first moment, which specifically includes the following sub-steps:
grouping all rows of the image matrix to obtain P array element groups, wherein the number P of the array element groups is determined by the following steps: calculating a second calculation parameterWhere L denotes the total number of rows of the image matrix, bit 1 Bus Bit width, bit, representing a memory region 0 The pixel bit width of the image matrix is represented, if the second calculation parameter is a positive integer, the number P of the array elements is the second calculation parameter, and the number of lines of each array element is(ii) a Otherwise, the number P of the array elements is a positive integer which is larger than the second calculation parameter and has the minimum difference with the second calculation parameter, and the row number from the first array element to the P-1 array element is L 1 Number of rows of the P-th array element group;
Performing 90-degree matrix transposition on each array element group respectively;
writing the array element groups after the conversion into a storage line of a storage area according to the line sequence, wherein the storage lines written by the array element groups are different;
when a first round of single-period convolution operation is carried out, M1 array elements are read from a storage area and output to a CONV computing unit at a first moment, each sub-CONV computing unit receives one of the M1 array elements, and each sub-CONV computing unit receives different array elements;
and when the next round of single-period convolution operation is carried out, if the number of the residual array elements in the storage area is less than M1, reading all the residual array elements from the storage area and outputting the residual array elements to the CONV calculating unit at the first moment, otherwise, reading M1 array elements from the storage area and outputting the M1 array elements to the CONV calculating unit at the first moment, until all the array elements of P storage lines are output to the CONV calculating unit at the first moment from the storage area.
And S3, inputting the prediction sample into the built YOLOV4 target detection model to identify a preset target.
In this embodiment, an image matrix is 608 × 608, an FPGA is of a model V5, 3600 DSP operator modules are included in the FPGA, a convolution kernel size of a CONV calculation unit at a first time is 3*3, a bus bit width of a storage region is 512 bits, a pixel bit width of the image matrix is 8bits, a single-cycle convolution operation of a 3 × 3 convolution kernel requires 9 DSP operator modules, and the number of M1 is 400. The number of array element groups is 10, the number of array element rows in the first array element group to the ninth array element group is 64, the number of array element rows in the tenth array element group is 32, the number of storage rows is 10, 10 port ports for outputting array elements in the storage rows are appointed from the storage areas, the first storage row corresponds to port0, the second storage row corresponds to port1, the third storage row corresponds to port2, the fourth storage row corresponds to port3, the fifth storage row corresponds to port4, the sixth storage row corresponds to port5, the seventh storage row corresponds to port6, the eighth storage row corresponds to port7, the ninth storage row corresponds to port8, and the tenth storage row corresponds to port9.
When the single-period convolution operation is carried out, a total of 400 array elements to the CONV calculation unit at the first moment are output from port0 to port6 of the storage area, each sub-CONV calculation unit receives one array element, and the single-period convolution operation of 400 array elements is simultaneously carried out. And port0 to port5 correspondingly output 64 array elements in the corresponding storage line, and port6 outputs 16 array elements in the corresponding storage line. After multiple rounds of single-period convolution operation, completing the single-period convolution operation of 400 rows of array elements of the image matrix before the transposition, then performing the single-period convolution operation of the remaining 208 rows of array elements of the image matrix before the transposition, and finally completing the single-period convolution operation of one frame of prediction samples with the image matrix of 608 x 608.
The embodiment has the following remarkable advantages: the method comprises the steps of disassembling a computing unit of a YOLOV4 network structure layer, realizing full utilization of FPGA computing resources through time division multiplexing to meet the requirement of high computing power required by a YOLOV4 network, and simultaneously combining optimization of an arrangement structure when an image matrix is written into a DDR storage and optimization combination of data read out from the DDR storage to construct a YOLOV4 target detection model based on an FPGA platform, wherein compared with the YOLOV3 application based on an embedded platform mentioned in the background technology, the YOLOV4 target detection model can realize identification of a preset target with higher accuracy and higher real-time performance.
The foregoing is illustrative of the preferred embodiments of the present invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and is not to be construed as limited to the exclusion of other embodiments, and that various other combinations, modifications, and environments may be used and modifications may be made within the scope of the concepts described herein, either by the above teachings or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (6)
1. A method for realizing a YOLOV4 target detection network based on an FPGA (field programmable gate array), wherein the FPGA comprises a DDR (double data rate) storage module, an LUT (look-up table) module and a DSP (digital signal processor) module, and is characterized by comprising the following steps:
s1, training a YOLOV4 target detection network on a server based on training samples;
s2, after the training of the YOLOV4 target detection network is completed, a YOLOV4 target detection model for identifying a preset target in a prediction sample is built on the FPGA, and the method comprises the following substeps:
s21, decomposing a computing unit in the YOLOV4 network structure layer: decomposing a YOLOV4 network structure layer to obtain a plurality of computing units, wherein the computing units comprise a Mish function computing unit, a Leaky Relu function computing unit, a BN batch normalization computing unit and a CONV computing unit;
s22, constructing a Mish function calculation unit on the FPGA: pre-storing an output value of a Mish function in the LUT module, and searching an output value corresponding to the input value in the LUT module by taking the input value of the Mish function as an address;
s23, constructing a Leaky Relu function calculation unit on the FPGA: constructing a Leaky Relu logic module on the FPGA, wherein the Leaky Relu logic module is used for receiving a first input value of a Leaky Relu function, generating a first output value of the Leaky Relu function corresponding to the first input value according to the first input value, and the first input value is larger than zero and is equal to the first output value corresponding to the first input value; pre-storing a second output value of the Leaky Relu function in the LUT module, and searching a second output value corresponding to the second input value in the LUT module by taking the second input value of the Leaky Relu function as an address, wherein the second input value is less than or equal to zero;
s24, constructing a BN batch normalization calculation unit on the FPGA: constructing a BN batch normalization logic module on an FPGA, and pre-storing variances and mean values for batch normalization processing in the BN batch normalization logic module, wherein the variances for batch normalization processing are mean values of all the variances of the training samples, and the mean values for batch normalization processing are mean values of all the means of the training samples;
s25, constructing a CONV calculation unit on the FPGA: according to the successive time of convolution operation of each CONV calculation unit, constructing a CONV calculation unit at a first time on the DSP module, and after the calculation of the CONV calculation unit at the first time is completed, reconstructing a CONV calculation unit at the next time on the DSP module;
s26, constructing a storage area on the FPGA: constructing a storage area on the DDR storage module, wherein the storage area is used for storing an image matrix of a CONV calculation unit input at a first moment, and the image matrix is generated according to a prediction sample; before the image matrix is stored, grouping all rows of the image matrix to obtain a plurality of array element groups, respectively performing 90-degree matrix transposition on each array element group, writing the transposed array element groups into a storage row of a storage area according to a row sequence, wherein the written storage rows of each array element group are different, and during convolution operation, reading the array elements of the array element groups in the storage rows and inputting the read array elements into a CONV (ConV) calculation unit at a first moment;
and S3, inputting the prediction sample into the built YOLOV4 target detection model to identify a preset target.
2. The method for implementing the FPGA-based YOLOV4 target detection network according to claim 1, wherein decomposing the YOLOV4 network structure layer to obtain a plurality of computing units specifically includes:
decomposing a skeleton layer of the YOLOV4 network structure layer to obtain a CSPn calculation unit;
decomposing the neck layer of the YOLOV4 network structure layer to obtain a CBL calculation unit;
decomposing a prediction layer of a YOLOV4 network structure layer to obtain a CBL calculation unit and a CONV calculation unit;
decomposing the CSPn calculating unit to obtain a CONV calculating unit, a BN batch normalization calculating unit and a Mish function calculating unit;
and decomposing the CBL calculation unit to obtain a CONV calculation unit, a BN batch normalization calculation unit and a Leaky Relu function calculation unit.
3. The FPGA-based YOLOV4 target detection network implementation method of claim 1, wherein the input values of the Mish function are a plurality of discrete values in an interval [ -6,6], the discrete values are arranged in a size sequence, and an absolute value of a difference between two adjacent discrete values is 0.1.
4. The FPGA-based YOLOV4 target detection network implementation method of claim 1, wherein the second input value of the leak Relu function is a plurality of discrete values in an interval [ -10,0], the discrete values are arranged in a size sequence, and an absolute value of a difference between two adjacent discrete values is 0.1.
5. The FPGA-based YOLOV4 target detection network implementation method of claim 1, wherein N DSP operation sub-modules are included in the DSP module;
the constructing a CONV calculating unit at a first moment on the DSP module specifically includes:
according to the convolution kernel size n1 x n1 of the CONV calculation unit at the first moment, constructing a first sub-CONV calculation unit on a first DSP calculation submodule to an n1 x n1 th DSP calculation submodule, wherein the sub-CONV calculation unit is used for performing single-cycle convolution operation of the n1 x n1 convolution kernel, then constructing a second sub-CONV calculation unit on the n1 x n1+1 th DSP calculation submodule to a 2 x n1 th DSP calculation submodule, and so on, completing construction of M1 sub-CONV calculation units on the DSP module; if the first calculation parameter A is 1 If the value is a positive integer, M1 is taken as a first calculation parameter, and if the value is the first calculation parameter A 1 If not, M1 is less than the first calculation parameter A 1 Of positive integers of (1), wherein。
6. The implementation method of the FPGA-based YOLOV4 target detection network according to claim 5, wherein the grouping is performed on all rows of an image matrix to obtain a plurality of array elements, then each array element group is respectively subjected to 90-degree matrix transposition, then the transposed array elements are written into a storage row of a storage region according to a row sequence, the written storage rows of each array element group are different, and during convolution operation, the array elements of the array elements in the storage row are read, and the read array elements are input into the CONV calculation unit at the first time, specifically comprising:
grouping all rows of the image matrix to obtain P array element groups;
calculating a second calculation parameterWhere L denotes the total number of rows of the image matrix, bit 1 Bus Bit width, bit, representing a memory region 0 The pixel bit width of the image matrix is represented, if the second calculation parameter is a positive integer, the number P of the array elements is the second calculation parameter, and the number of rows of each array element isOtherwise, the number P of the array elements is a positive integer which is larger than the second calculation parameter and has the minimum difference with the second calculation parameter, and the row number from the first array element to the P-1 array element is L 1 Number of rows of the P-th array element group;
Performing 90-degree matrix transposition on each array element group;
writing the array element groups after the conversion into a storage line of a storage area according to the line sequence, wherein the storage lines written by the array element groups are different;
when a first round of single-period convolution operation is carried out, M1 array elements are read from a storage area and output to a CONV computing unit at a first moment, and each sub-CONV computing unit receives one array element of the M1 array elements;
and when the next round of single-period convolution operation is carried out, if the number of the residual array elements in the storage area is less than M1, reading all the residual array elements from the storage area and outputting the residual array elements to the CONV calculating unit at the first moment, otherwise, reading M1 array elements from the storage area and outputting the M1 array elements to the CONV calculating unit at the first moment, until all the array elements of P storage lines are output to the CONV calculating unit at the first moment from the storage area.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210983908.3A CN115049907B (en) | 2022-08-17 | 2022-08-17 | FPGA-based YOLOV4 target detection network implementation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210983908.3A CN115049907B (en) | 2022-08-17 | 2022-08-17 | FPGA-based YOLOV4 target detection network implementation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115049907A CN115049907A (en) | 2022-09-13 |
CN115049907B true CN115049907B (en) | 2022-10-28 |
Family
ID=83168081
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210983908.3A Active CN115049907B (en) | 2022-08-17 | 2022-08-17 | FPGA-based YOLOV4 target detection network implementation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115049907B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109214504A (en) * | 2018-08-24 | 2019-01-15 | 北京邮电大学深圳研究院 | A kind of YOLO network forward inference accelerator design method based on FPGA |
CN110175670A (en) * | 2019-04-09 | 2019-08-27 | 华中科技大学 | A kind of method and system for realizing YOLOv2 detection network based on FPGA |
CN112911171A (en) * | 2021-02-04 | 2021-06-04 | 上海航天控制技术研究所 | Intelligent photoelectric information processing system and method based on accelerated processing |
CN112926410A (en) * | 2021-02-03 | 2021-06-08 | 深圳市维海德技术股份有限公司 | Target tracking method and device, storage medium and intelligent video system |
CN112989924A (en) * | 2021-01-26 | 2021-06-18 | 深圳市优必选科技股份有限公司 | Target detection method, target detection device and terminal equipment |
CN114154621A (en) * | 2021-11-30 | 2022-03-08 | 长沙行深智能科技有限公司 | Convolutional neural network image processing method and device based on FPGA |
CN114359662A (en) * | 2021-12-24 | 2022-04-15 | 江苏大学 | Implementation method of convolutional neural network based on heterogeneous FPGA and fusion multiresolution |
CN114694002A (en) * | 2022-03-11 | 2022-07-01 | 中国电子科技集团公司第五十四研究所 | Infrared target detection method based on feature fusion and attention mechanism |
CN114743273A (en) * | 2022-04-28 | 2022-07-12 | 西安交通大学 | Human skeleton behavior identification method and system based on multi-scale residual error map convolutional network |
CN114863258A (en) * | 2022-07-06 | 2022-08-05 | 四川迪晟新达类脑智能技术有限公司 | Method for detecting small target based on visual angle conversion in sea-sky-line scene |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200348662A1 (en) * | 2016-05-09 | 2020-11-05 | Strong Force Iot Portfolio 2016, Llc | Platform for facilitating development of intelligence in an industrial internet of things system |
US11112784B2 (en) * | 2016-05-09 | 2021-09-07 | Strong Force Iot Portfolio 2016, Llc | Methods and systems for communications in an industrial internet of things data collection environment with large data sets |
-
2022
- 2022-08-17 CN CN202210983908.3A patent/CN115049907B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109214504A (en) * | 2018-08-24 | 2019-01-15 | 北京邮电大学深圳研究院 | A kind of YOLO network forward inference accelerator design method based on FPGA |
CN110175670A (en) * | 2019-04-09 | 2019-08-27 | 华中科技大学 | A kind of method and system for realizing YOLOv2 detection network based on FPGA |
CN112989924A (en) * | 2021-01-26 | 2021-06-18 | 深圳市优必选科技股份有限公司 | Target detection method, target detection device and terminal equipment |
CN112926410A (en) * | 2021-02-03 | 2021-06-08 | 深圳市维海德技术股份有限公司 | Target tracking method and device, storage medium and intelligent video system |
CN112911171A (en) * | 2021-02-04 | 2021-06-04 | 上海航天控制技术研究所 | Intelligent photoelectric information processing system and method based on accelerated processing |
CN114154621A (en) * | 2021-11-30 | 2022-03-08 | 长沙行深智能科技有限公司 | Convolutional neural network image processing method and device based on FPGA |
CN114359662A (en) * | 2021-12-24 | 2022-04-15 | 江苏大学 | Implementation method of convolutional neural network based on heterogeneous FPGA and fusion multiresolution |
CN114694002A (en) * | 2022-03-11 | 2022-07-01 | 中国电子科技集团公司第五十四研究所 | Infrared target detection method based on feature fusion and attention mechanism |
CN114743273A (en) * | 2022-04-28 | 2022-07-12 | 西安交通大学 | Human skeleton behavior identification method and system based on multi-scale residual error map convolutional network |
CN114863258A (en) * | 2022-07-06 | 2022-08-05 | 四川迪晟新达类脑智能技术有限公司 | Method for detecting small target based on visual angle conversion in sea-sky-line scene |
Non-Patent Citations (7)
Title |
---|
Acc-YOLOv4目标检测算法软硬件加速研究;张春野;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220315;I138-2096 * |
FPGA Overlay Processor for Deep Neural Networks;Yunxuan Yu;《UNIVERSITY OF CALIFORNIA Los Angeles DOCTOR STUDENT THESES》;20201231;1-186 * |
Real-Time YOLOv4 FPGA Design with Catapult High-Level Synthesis;Heinsius, L.R.;《UNIVERSITY OF TWENTE MASTER STUDENT THESES》;20211231;1-99 * |
Resource- and Power-Efficient High-Performance Object Detection Inference Acceleration Using FPGA;Solomon Negussie Tesema 等;《Electronics》;20220608;1-29 * |
Resource-constrained FPGA implementation of YOLOv2;Zhichao Zhang 等;《Springer:Neural Computing and Applications》;20220529;1-19 * |
基于改进YOLOv4-Tiny的FPGA加速方法;曹远杰 等;《无线电工程》;20220107;604-611 * |
基于深度学习的航拍图像目标检测算法设计与实现;王雪纯;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220115;I138-1387 * |
Also Published As
Publication number | Publication date |
---|---|
CN115049907A (en) | 2022-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108805266B (en) | Reconfigurable CNN high-concurrency convolution accelerator | |
CN109146067B (en) | Policy convolution neural network accelerator based on FPGA | |
CN108764467B (en) | Convolution operation and full-connection operation circuit for convolution neural network | |
CN113627601A (en) | Subunit, MAC array and analog-digital mixed memory computing module with reconfigurable bit width | |
CN114937470B (en) | Fixed point full-precision memory computing circuit based on multi-bit SRAM unit | |
CN114418080A (en) | Storage and calculation integrated operation method, memristor neural network chip and storage medium | |
US11556614B2 (en) | Apparatus and method for convolution operation | |
CN115423081A (en) | Neural network accelerator based on CNN _ LSTM algorithm of FPGA | |
CN113345484A (en) | Data operation circuit and storage and calculation integrated chip | |
CN105337618A (en) | Multimode IRA_LDPC decoder with parallel downward compatibility and decoding method thereof | |
CN115049907B (en) | FPGA-based YOLOV4 target detection network implementation method | |
CN113342310B (en) | Serial parameter matched quick number theory conversion hardware accelerator for grid cipher | |
CN112784951A (en) | Winograd convolution operation method and related product | |
CN109669666A (en) | Multiply accumulating processor | |
CN113869446A (en) | CNN target identification system and method based on FPGA | |
US20230253032A1 (en) | In-memory computation device and in-memory computation method to perform multiplication operation in memory cell array according to bit orders | |
CN110766136B (en) | Compression method of sparse matrix and vector | |
Chang et al. | HDSuper: Algorithm-Hardware Co-design for Light-weight High-quality Super-Resolution Accelerator | |
CN116339680A (en) | Real-time multiport parallel read-write near-memory processor | |
CN115618177A (en) | Covariance matrix operation hardware acceleration system based on state machine | |
CN115600647A (en) | Sparse neural network acceleration-oriented bit-level calculation model architecture system | |
CN113392963B (en) | FPGA-based CNN hardware acceleration system design method | |
US11567731B2 (en) | Device for computing an inner product | |
CN113988279A (en) | Output current reading method and system of storage array supporting negative value excitation | |
CN109117114B (en) | Low-complexity approximate multiplier based on lookup table |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |