CN115049907B - FPGA-based YOLOV4 target detection network implementation method - Google Patents

FPGA-based YOLOV4 target detection network implementation method Download PDF

Info

Publication number
CN115049907B
CN115049907B CN202210983908.3A CN202210983908A CN115049907B CN 115049907 B CN115049907 B CN 115049907B CN 202210983908 A CN202210983908 A CN 202210983908A CN 115049907 B CN115049907 B CN 115049907B
Authority
CN
China
Prior art keywords
conv
fpga
calculation unit
calculation
yolov4
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210983908.3A
Other languages
Chinese (zh)
Other versions
CN115049907A (en
Inventor
褚俊波
李非桃
冉欢欢
李和伦
陈益
王丹
陈春
李毅捷
赵瑞欣
莫桥波
王逸凡
李东晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Desheng Xinda Brain Intelligence Technology Co ltd
Original Assignee
Sichuan Desheng Xinda Brain Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Desheng Xinda Brain Intelligence Technology Co ltd filed Critical Sichuan Desheng Xinda Brain Intelligence Technology Co ltd
Priority to CN202210983908.3A priority Critical patent/CN115049907B/en
Publication of CN115049907A publication Critical patent/CN115049907A/en
Application granted granted Critical
Publication of CN115049907B publication Critical patent/CN115049907B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Complex Calculations (AREA)
  • Radar Systems Or Details Thereof (AREA)

Abstract

The invention discloses a method for realizing a YOLOV4 target detection network based on FPGA, which comprises the steps of training the YOLOV4 target detection network; after the training of the YOLOV4 target detection network is completed, a YOLOV4 target detection model is built on the FPGA; and identifying a preset target. The method for building the Yolov4 target detection model comprises the following substeps: decomposing a computing unit in a YOLOV4 network structure layer; constructing a Mish function computing unit on an FPGA; constructing a Leaky Relu function calculation unit on the FPGA; constructing a BN batch normalization calculation unit on the FPGA; constructing a CONV calculation unit on the FPGA; and constructing a storage area on the FPGA. The invention realizes convenience by using FPGA hardware resources and logic, and improves the accuracy and efficiency of the YOLOV4 target detection model on the premise of ensuring the calculation capacity.

Description

FPGA-based YOLOV4 target detection network implementation method
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to a method for realizing a YOLOV4 target detection network based on an FPGA.
Background
At present, most target detection networks based on a YOLO architecture are realized based on embedded platforms, and the YOLOV3 architecture is mainly applied to embedded platforms such as Haisi 3559, rui-core micro 1808 and Rui-core micro 3399. The implementation of the YOLOV3 architecture in the embedded platform has the following defects: compared with the YOLOV4, the network structure layer of the YOLOV3 has more layers and is more complex, and the identification accuracy of the target is lower; the embedded platform is not computationally efficient, and it is difficult to achieve ideal target recognition accuracy while achieving full frame rate applications.
The YOLOV4 network structure includes an input layer, a backbone layer, a neck layer, and a prediction layer, wherein the backbone layer, neck layer, and prediction layer are architectures that require a design implementation. The bone stem layer comprises a CSPn computing unit, the CSPn computing unit is a cross-stage part connection computing unit, n represents the number of residual error units, the CSPn computing unit comprises a CBM computing unit and n residual error units, the CBM computing unit represents a computing unit consisting of a CONV computing unit, a BN Batch Normalization computing unit and a Mish function computing unit (Mish activation function), the CONV computing unit represents a unit for convolution operation, and the BN Batch Normalization computing unit represents a unit for BN (Batch Normalization) Batch Normalization operation. The neck layer comprises a plurality of CBL calculation units, wherein each CBL calculation unit represents a calculation unit consisting of a CONV calculation unit, a BN batch normalization calculation unit and a Leaky Relu function calculation unit (Leaky Relu activation function). The prediction layer includes a plurality of CBL calculation units and a plurality of CONV calculation units.
The FPGA platform has abundant DSP computing resources and can provide good computing power. In addition, the FPGA platform also has rich DDR storage resources, and can realize the rapid storage and reading of image data, so that the realization of the YOLOV4 target detection network on the FPGA has feasibility, and therefore, the research on the realization of the YOLOV4 target detection network based on the FPGA is necessary.
Disclosure of Invention
The invention aims to overcome one or more defects in the prior art and provides a method for realizing a Yolov4 target detection network based on an FPGA.
The purpose of the invention is realized by the following technical scheme:
a method for realizing a YOLOV4 target detection network based on an FPGA (field programmable gate array), wherein the FPGA comprises a DDR (double data rate) storage module, an LUT (look-up table) module and a DSP (digital signal processor) module, and the method comprises the following steps:
s1, training a YOLOV4 target detection network on a server based on a training sample;
s2, after the training of the YOLOV4 target detection network is completed, a YOLOV4 target detection model for identifying a preset target in a prediction sample is built on the FPGA, and the method comprises the following substeps:
s21, decomposing a computing unit in the YOLOV4 network structure layer: decomposing a YOLOV4 network structure layer to obtain a plurality of computing units, wherein the computing units comprise a Mish function computing unit, a Leaky Relu function computing unit, a BN batch normalization computing unit and a CONV computing unit;
s22, constructing a Mish function calculation unit on the FPGA: pre-storing an output value of a Mish function in the LUT module, and searching an output value corresponding to the input value in the LUT module by taking the input value of the Mish function as an address;
s23, constructing a Leaky Relu function calculation unit on the FPGA: a Leaky Relu logic module is constructed on the FPGA,
the Leaky Relu logic module is used for receiving a first input value of a Leaky Relu function, generating a first output value of the Leaky Relu function corresponding to the first input value according to the first input value, wherein the first input value is larger than zero, and the first output value corresponding to the first input value is equal to the first input value; pre-storing a second output value of the Leaky Relu function in the LUT module, and using a second input value of the Leaky Relu function as an address to search a second output value corresponding to the second input value in the LUT module, wherein the second input value is less than or equal to zero;
s24, constructing a BN batch normalization calculation unit on the FPGA: constructing a BN batch normalization logic module on an FPGA, and pre-storing variances and mean values for batch normalization processing in the BN batch normalization logic module, wherein the variances for batch normalization processing are mean values of all the variances of the training samples, and the mean values for batch normalization processing are mean values of all the means of the training samples;
s25, constructing a CONV calculation unit on the FPGA: according to the successive time of convolution operation of each CONV calculation unit, constructing a CONV calculation unit at a first time on the DSP module, and after the calculation of the CONV calculation unit at the first time is completed, reconstructing a CONV calculation unit at the next time on the DSP module;
s26, constructing a storage area on the FPGA: constructing a storage area on the DDR storage module, wherein the storage area is used for storing an image matrix of a CONV calculation unit input at a first moment, and the image matrix is generated according to a prediction sample; before the image matrix is stored, grouping all rows of the image matrix to obtain a plurality of array element groups, respectively performing 90-degree matrix transposition on each array element group, writing the transposed array element groups into a storage row of a storage area according to a row sequence, wherein the written storage rows of each array element group are different, and during convolution operation, reading the array elements of the array element groups in the storage rows and inputting the read array elements into a CONV (ConV) calculation unit at a first moment;
and S3, inputting the prediction sample into the built YOLOV4 target detection model to identify a preset target.
Preferably, decomposing the YOLOV4 network structure layer to obtain a plurality of computing units specifically includes:
decomposing a skeleton layer of the YOLOV4 network structure layer to obtain a CSPn calculation unit;
decomposing the neck layer of the YOLOV4 network structure layer to obtain a CBL calculation unit;
decomposing a prediction layer of a YOLOV4 network structure layer to obtain a CBL calculation unit and a CONV calculation unit;
decomposing the CSPn calculation unit to obtain the CONV calculation unit, the BN batch normalization calculation unit and the Mish function
A calculation unit;
and decomposing the CBL calculation unit to obtain a CONV calculation unit, a BN batch normalization calculation unit and a Leaky Relu function calculation unit.
Preferably, the input value of the Mish function is a plurality of discrete numerical values in an interval [ -6,6], each discrete numerical value is arranged in a size sequence, and the absolute value of the difference value between two adjacent discrete numerical values is 0.1.
Preferably, the second input value of the leak Relu function is a plurality of discrete values in the interval [ -10,0], each discrete value is arranged in order of magnitude, and the absolute value of the difference between two adjacent discrete values is 0.1.
Preferably, the DSP module includes N DSP operation submodules;
the constructing a CONV calculating unit at a first moment on the DSP module specifically includes:
calculating the convolution kernel size n1 of the unit from the CONV at the first instantn1, constructing a first sub-CONV calculating unit from the first DSP calculating submodule to the n1 × n1 DSP calculating submodules, wherein the sub-CONV calculating unit is used for performing single-period convolution operation of n1 × n1 convolution kernels, then constructing a second sub-CONV calculating unit from the n1 × n1+1 DSP calculating submodules to the 2 × n1 DSP calculating submodules, and so on, and completing construction of M1 sub-CONV calculating units on the DSP modules; if the first calculation parameter A is 1 If the value is a positive integer, M1 is taken as a first calculation parameter, and if the value is the first calculation parameter A 1 If not, M1 is less than the first calculation parameter A 1 Of positive integers of (1), wherein
Figure 194159DEST_PATH_IMAGE001
Preferably, the grouping is performed on all rows of the image matrix to obtain a plurality of array element groups, then 90-degree matrix transposition is performed on each array element group, then the array element groups after transposition are written into one storage row of the storage area in a row sequence, the storage rows written into each array element group are different, during convolution operation, the array elements of the array element groups in the storage rows are read, and the read array elements are input into the CONV calculation unit at the first time, which specifically includes:
grouping all rows of the image matrix to obtain P array element groups;
calculating a second calculation parameter
Figure 515419DEST_PATH_IMAGE002
Where L denotes the total number of rows of the image matrix, bit 1 Bus Bit width, bit, representing a memory region 0 The pixel bit width of the image matrix is represented, if the second calculation parameter is a positive integer, the number P of the array elements is the second calculation parameter, and the number of rows of each array element is
Figure 82798DEST_PATH_IMAGE003
Otherwise, the number P of the array elements is a positive integer which is larger than the second calculation parameter and has the minimum difference with the second calculation parameter, and the row number from the first array element to the P-1 array element is L 1 Of P thNumber of rows of array element group
Figure 882127DEST_PATH_IMAGE004
Performing 90-degree matrix transposition on each array element group;
writing the array element groups after the conversion into a storage line of a storage area according to the line sequence, wherein the storage lines written by the array element groups are different;
when a first round of single-period convolution operation is carried out, M1 array elements are read from a storage area and output to a CONV computing unit at a first moment, and each sub-CONV computing unit receives one array element of the M1 array elements;
and when the next round of single-period convolution operation is carried out, if the number of the residual array elements in the storage area is less than M1, reading all the residual array elements from the storage area and outputting the residual array elements to the CONV calculating unit at the first moment, otherwise, reading M1 array elements from the storage area and outputting the M1 array elements to the CONV calculating unit at the first moment, until all the array elements of P storage lines are output to the CONV calculating unit at the first moment from the storage area.
The invention has the beneficial effects that:
(1) Decomposing a YOLOV4 network structure layer to obtain four types of basic computing units which are respectively a Mish function computing unit, a Leaky Relu function computing unit, a BN batch normalization computing unit and a CONV computing unit, wherein the four types of basic computing units complete all computations of a YOLOV4 target detection network; the Mish function computing unit is built in a lookup table mode, and the part of the Leaky Relu function computing unit, of which the input value is less than or equal to 0, is built in a lookup table mode, so that LUT (look-up table) module resources are fully utilized, and the computing efficiency of the Mish function computing unit and the Leaky Relu function computing unit is improved; by constructing the BN batch normalization logic module, and through the fixed presetting of the variance and the mean value, the logic realization is only fixed coefficient division operation, and the calculation efficiency of the BN batch normalization calculation unit is improved; constructing the CONV calculation units of each successive moment in a time-sharing manner on the same DSP module in a time-sharing multiplexing manner, so that DSP calculation resources of the FPGA meet the calculation force requirement of the YOLOV4 network; in addition, the image matrix is grouped, matrix transposition is carried out before storage aiming at each array element group, the array element group after matrix transposition is stored according to the row sequence and is stored to different storage rows of the storage area, therefore, the storage area for storing the image matrix is constructed on the DDR storage module, if transposition is not carried out, one round of single-period convolution operation can only use one DSP operation submodule to carry out one convolution operation, the reordering aiming at the intra-frame image array elements is carried out, a port connected with the storage row can simultaneously read a plurality of array elements and load the array elements into a corresponding CONV calculation unit, and the utilization rate of DSP operation resources is improved.
The building process fully utilizes hardware resources of the FPGA and convenience of FPGA logic realization, achieves building of the YOLOV4 target detection model based on the FPGA, guarantees the computing power of the YOLOV4 target detection model, achieves improvement of target recognition accuracy based on the computing power guarantee, and achieves improvement of target recognition efficiency based on efficient flow reuse of DSP computing resources.
(2) When the CONV calculation unit at each moment is constructed on the DSP module, the number of single-period convolution operations which can be carried out simultaneously is determined according to the size of a convolution kernel and the principle that a plurality of DSP operation sub-modules in the DSP module are simultaneously utilized and are not idle as much as possible, and the sub-CONV calculation units which are used for the single-period convolution operations and have the corresponding number are constructed, so that the full utilization of DSP operation resources in the FPGA is realized, and the efficiency maximization is further realized on the basis of time division multiplexing.
(3) And by combining the DDR high-efficiency reading flow characteristics, during single-period convolution operation, array elements with the number consistent with that of the sub-CONV computing units are read from a storage area and are transmitted to the sub-CONV computing units, each CONV computing unit receives one array element, data support is provided for the CONV computing units to simultaneously perform single-period convolution operation, and the maximization of the target identification efficiency of the Yolov4 target detection network model is ensured.
Drawings
Fig. 1 is a schematic diagram of decomposition of a YOLOV4 network structure layer;
FIG. 2 is a schematic diagram of an image matrix;
FIG. 3 is a schematic diagram of image matrix storage in a storage area after image matrix replacement.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
Referring to fig. 1 to fig. 3, the present embodiment provides a method for implementing a YOLOV4 target detection network based on an FPGA, where the FPGA includes a DDR memory module, an LUT module, and a DSP module. The LUT module represents a Look-Up-Table lookup Table module. The method specifically comprises the following steps:
s1, training a YOLOV4 target detection network on a server based on training samples.
S2, after the training of the YOLOV4 target detection network is completed, a YOLOV4 target detection model for identifying a preset target in a prediction sample is built on the FPGA, and the method comprises the following substeps:
s21, decomposing a computing unit in a YOLOV4 network structure layer: decomposing the YOLOV4 network structure layer to obtain a plurality of computing units, wherein the computing units comprise a Mish function computing unit (Mish activation function), a Leaky Relu function computing unit (Leaky Relu activation function), a BN batch normalization computing unit and a CONV computing unit.
Optionally, decomposing the YOLOV4 network structure layer to obtain a plurality of computing units, specifically including the following sub-steps: decomposing a bone dry layer of the YOLOV4 network structure layer to obtain a CSPn calculation unit; decomposing the neck layer of the YOLOV4 network structure layer to obtain a CBL calculation unit; decomposing a prediction layer of a YOLOV4 network structure layer to obtain a CBL calculation unit and a CONV calculation unit; decomposing the CSPn calculating unit to obtain a CONV calculating unit, a BN batch normalization calculating unit and a Mish function calculating unit; and decomposing the CBL calculation unit to obtain a CONV calculation unit, a BN batch normalization calculation unit and a Leaky Relu function calculation unit.
S22, constructing a Mish function calculation unit on the FPGA: and pre-storing the output value of the Mish function in the LUT module, and searching the output value corresponding to the input value in the LUT module by taking the input value of the Mish function as an address. The expression of the Mish function is:
Figure 671091DEST_PATH_IMAGE005
optionally, the value range of the input value of the hash function is [ -6,6], the value range is discrete values, each discrete value is arranged according to the magnitude sequence, and the absolute value of the difference between two adjacent input values of the hash function is 0.1. The mapping relationship between the partial input values and the partial output values of the Mish function computing unit is shown in the following table I. Therefore, compared with the realization of the Mish function calculation unit by constructing other logic modules on the FPGA, the realization of the Mish function calculation unit by the LUT lookup table mode is very convenient, and the calculation efficiency is also improved.
Watch 1
Figure 412520DEST_PATH_IMAGE006
S23, constructing a Leaky Relu function calculation unit on the FPGA: and constructing a Leaky Relu logic module on the FPGA, wherein the Leaky Relu logic module is used for receiving a first input value of a Leaky Relu function, generating a first output value of the Leaky Relu function corresponding to the first input value according to the first input value, the first input value is greater than zero, and the first output value corresponding to the first input value is equal to the first input value. Based on the convenience of logic implementation on the FPGA, the input value and the output value of the Leaky Relu logic module are the same, so that the logic module is easy to implement on the FPGA. Pre-storing a second output value of the Leaky Relu function in the LUT module, using a second input value of the Leaky Relu function as an address to search a second output value corresponding to the second input value in the LUT module, wherein the second input value is smaller than the equal valueAt zero. The expression of the leak Relu function is:
Figure 982041DEST_PATH_IMAGE007
optionally, the value interval of the second input value of the leak Relu function is [ -10,0], and the second input value is discrete values, each discrete value is arranged according to the size sequence, and the absolute value of the difference between two adjacent second input values of the leak Relu function is 0.1. The mapping relationship between the partial input values and the partial output values of the leakage Relu function calculation unit is shown in the following Table II. Therefore, the calculation of the part of the input value of the Leaky Relu function, which is less than or equal to 0, is realized based on the LUT module in the FPGA module, so that the method is more convenient compared with the construction of other logic modules on the FPGA, and the calculation efficiency is also improved.
Watch two
Figure 273477DEST_PATH_IMAGE008
S24, constructing a BN batch normalization calculation unit on the FPGA: and constructing a BN batch normalization logic module on the FPGA, and pre-storing the variance and the mean value for batch normalization in the BN batch normalization logic module, wherein the variance for batch normalization is the mean value of the variances of all the training samples, and the mean value for batch normalization is the mean value of the mean values of all the training samples. After the Yolov4 target detection model is constructed, when target identification is performed, the calculation formula of batch normalization processing of the sample is as follows:
Figure DEST_PATH_IMAGE009
wherein
Figure 510423DEST_PATH_IMAGE010
A sample is represented by a sample of the sample,
Figure 422753DEST_PATH_IMAGE011
represents the samples after the batch normalization process,
Figure 417254DEST_PATH_IMAGE012
represents the mean value of the batch normalization process,
Figure 574697DEST_PATH_IMAGE013
the variance of the batch normalization process is represented,
Figure 134991DEST_PATH_IMAGE015
is a constant. Therefore, the BN batch normalization logic module only performs division operation of fixed coefficients when performing batch normalization processing on samples, the efficiency of batch normalization processing is improved, and the BN batch normalization logic module is easy to realize on an FPGA based on the convenience of logic realization on the FPGA.
S25, constructing a CONV calculation unit on the FPGA: and according to the successive time of convolution operation of each CONV calculation unit, constructing a CONV calculation unit at the first time on the DSP module, and after the calculation of the CONV calculation unit at the first time is completed, reconstructing a CONV calculation unit at the next time on the DSP module. Based on the decomposition of the YOLOV4 network structure layer in S21, the entire YOLOV4 network structure layer includes a plurality of CONV calculation units, and each CONV calculation unit is distributed in the YOLOV4 network structure layer in sequence. The method comprises the steps that a first-time CONV computing unit is built on a DSP module on an FPGA in a time-sharing multiplexing mode, after the first-time CONV computing unit is computed, the DSP module is released, wherein the releasing means that operations such as clearing and the like are carried out on the numerical value of a convolution kernel loaded in the first-time CONV computing unit, then a next-time CONV computing unit is built on the DSP module, correspondingly, the numerical value of the convolution kernel of the first-time CONV computing unit is loaded in the first-time CONV computing unit, and the like until the last-time CONV computing unit is computed, the DSP module is released, and then a last-time CONV computing unit is built on the DSP module. Because the calculation amount of the CONV calculation unit is the largest in the whole YOLOV4 network structure layer, the high calculation force required by the YOLOV4 network structure layer is ensured by time-sharing multiplexing the DSP module.
Optionally, the DSP module includes N DSP operation submodules, and the constructing of the CONV calculation unit at the first time on the DSP module specifically includes the following substeps:
according to the convolution kernel size n1 x n1 of the CONV calculation unit at the first moment, a first sub-CONV calculation unit is constructed on a first DSP calculation sub-module to an n1 x n1 DSP calculation sub-module, the sub-CONV calculation unit is used for performing single-period convolution operation on n1 x n1 convolution kernels, then a second sub-CONV calculation unit is constructed on an n1 x n1+1 DSP calculation sub-module to a 2 n1 x n1 DSP calculation sub-module, and by analogy, construction of M1 sub-CONV calculation units on the DSP module is completed; if the first calculation parameter A is 1 If the value is a positive integer, M1 is taken as a first calculation parameter, and if the value is the first calculation parameter A 1 If not, M1 is less than the first calculation parameter A 1 Of positive integers of (1), wherein
Figure 172217DEST_PATH_IMAGE016
It is known that the reconstruction of the CONV calculation unit at the next moment on the DSP module specifically includes the following sub-steps:
according to the convolution kernel size n2 x n2 of the CONV calculation unit at the next moment, a first sub-CONV calculation unit is constructed on the first DSP calculation sub-module to the n2 x n2 DSP calculation sub-modules, the sub-CONV calculation unit is used for performing single-period convolution operation on n2 x n2 convolution kernels, then a second sub-CONV calculation unit is constructed on the n2 x n2+1 DSP calculation sub-modules to the 2 n2 x n2 DSP calculation sub-modules, and by analogy, construction of M2 sub-CONV calculation units on the DSP modules is completed; if the third calculation parameter A 3 If the number of the first calculation parameter is positive integer, M2 is taken as a third calculation parameter, and if the third calculation parameter A is positive integer 3 If not, M2 is less than the third calculation parameter A 3 Of positive integers of (1), wherein
Figure 700019DEST_PATH_IMAGE017
The single-period convolution operation refers to a convolution operation performed by a convolution kernel and a pixel, for example, the convolution kernel size n1 × n1, and the single-period convolution operation includes n1 × n1 times of multiplication operations and n1 × n1-1 times of addition operations.
S26, constructing a storage area on the FPGA: constructing a storage area on the DDR storage module, wherein the storage area is used for storing an image matrix of the CONV calculation unit input at a first moment, and the image matrix is generated according to a prediction sample; before the image matrix is stored, all rows of the image matrix are grouped to obtain a plurality of array elements, then 90-degree matrix transposition is carried out on each array element group, the array element groups after transposition are written into a storage row of a storage area according to the row sequence, the storage rows written into by each array element group are different, during convolution operation, the array elements of the array elements in the storage rows are read, and the read array elements are input into a CONV computing unit at a first moment.
Optionally, all rows of the image matrix are grouped to obtain a plurality of array elements, then each array element group is subjected to 90-degree matrix transposition, then the array element group after being transposed is written into a storage row of the storage area according to a row sequence, the storage rows written into by each array element group are different, during convolution operation, the array elements of the array elements in the storage row are read, and the read array elements are input into the CONV calculation unit at the first moment, which specifically includes the following sub-steps:
grouping all rows of the image matrix to obtain P array element groups, wherein the number P of the array element groups is determined by the following steps: calculating a second calculation parameter
Figure 379262DEST_PATH_IMAGE018
Where L denotes the total number of rows of the image matrix, bit 1 Bus Bit width, bit, representing a memory region 0 The pixel bit width of the image matrix is represented, if the second calculation parameter is a positive integer, the number P of the array elements is the second calculation parameter, and the number of lines of each array element is
Figure 810375DEST_PATH_IMAGE019
(ii) a Otherwise, the number P of the array elements is a positive integer which is larger than the second calculation parameter and has the minimum difference with the second calculation parameter, and the row number from the first array element to the P-1 array element is L 1 Number of rows of the P-th array element group
Figure 815240DEST_PATH_IMAGE020
Performing 90-degree matrix transposition on each array element group respectively;
writing the array element groups after the conversion into a storage line of a storage area according to the line sequence, wherein the storage lines written by the array element groups are different;
when a first round of single-period convolution operation is carried out, M1 array elements are read from a storage area and output to a CONV computing unit at a first moment, each sub-CONV computing unit receives one of the M1 array elements, and each sub-CONV computing unit receives different array elements;
and when the next round of single-period convolution operation is carried out, if the number of the residual array elements in the storage area is less than M1, reading all the residual array elements from the storage area and outputting the residual array elements to the CONV calculating unit at the first moment, otherwise, reading M1 array elements from the storage area and outputting the M1 array elements to the CONV calculating unit at the first moment, until all the array elements of P storage lines are output to the CONV calculating unit at the first moment from the storage area.
And S3, inputting the prediction sample into the built YOLOV4 target detection model to identify a preset target.
In this embodiment, an image matrix is 608 × 608, an FPGA is of a model V5, 3600 DSP operator modules are included in the FPGA, a convolution kernel size of a CONV calculation unit at a first time is 3*3, a bus bit width of a storage region is 512 bits, a pixel bit width of the image matrix is 8bits, a single-cycle convolution operation of a 3 × 3 convolution kernel requires 9 DSP operator modules, and the number of M1 is 400. The number of array element groups is 10, the number of array element rows in the first array element group to the ninth array element group is 64, the number of array element rows in the tenth array element group is 32, the number of storage rows is 10, 10 port ports for outputting array elements in the storage rows are appointed from the storage areas, the first storage row corresponds to port0, the second storage row corresponds to port1, the third storage row corresponds to port2, the fourth storage row corresponds to port3, the fifth storage row corresponds to port4, the sixth storage row corresponds to port5, the seventh storage row corresponds to port6, the eighth storage row corresponds to port7, the ninth storage row corresponds to port8, and the tenth storage row corresponds to port9.
When the single-period convolution operation is carried out, a total of 400 array elements to the CONV calculation unit at the first moment are output from port0 to port6 of the storage area, each sub-CONV calculation unit receives one array element, and the single-period convolution operation of 400 array elements is simultaneously carried out. And port0 to port5 correspondingly output 64 array elements in the corresponding storage line, and port6 outputs 16 array elements in the corresponding storage line. After multiple rounds of single-period convolution operation, completing the single-period convolution operation of 400 rows of array elements of the image matrix before the transposition, then performing the single-period convolution operation of the remaining 208 rows of array elements of the image matrix before the transposition, and finally completing the single-period convolution operation of one frame of prediction samples with the image matrix of 608 x 608.
The embodiment has the following remarkable advantages: the method comprises the steps of disassembling a computing unit of a YOLOV4 network structure layer, realizing full utilization of FPGA computing resources through time division multiplexing to meet the requirement of high computing power required by a YOLOV4 network, and simultaneously combining optimization of an arrangement structure when an image matrix is written into a DDR storage and optimization combination of data read out from the DDR storage to construct a YOLOV4 target detection model based on an FPGA platform, wherein compared with the YOLOV3 application based on an embedded platform mentioned in the background technology, the YOLOV4 target detection model can realize identification of a preset target with higher accuracy and higher real-time performance.
The foregoing is illustrative of the preferred embodiments of the present invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and is not to be construed as limited to the exclusion of other embodiments, and that various other combinations, modifications, and environments may be used and modifications may be made within the scope of the concepts described herein, either by the above teachings or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. A method for realizing a YOLOV4 target detection network based on an FPGA (field programmable gate array), wherein the FPGA comprises a DDR (double data rate) storage module, an LUT (look-up table) module and a DSP (digital signal processor) module, and is characterized by comprising the following steps:
s1, training a YOLOV4 target detection network on a server based on training samples;
s2, after the training of the YOLOV4 target detection network is completed, a YOLOV4 target detection model for identifying a preset target in a prediction sample is built on the FPGA, and the method comprises the following substeps:
s21, decomposing a computing unit in the YOLOV4 network structure layer: decomposing a YOLOV4 network structure layer to obtain a plurality of computing units, wherein the computing units comprise a Mish function computing unit, a Leaky Relu function computing unit, a BN batch normalization computing unit and a CONV computing unit;
s22, constructing a Mish function calculation unit on the FPGA: pre-storing an output value of a Mish function in the LUT module, and searching an output value corresponding to the input value in the LUT module by taking the input value of the Mish function as an address;
s23, constructing a Leaky Relu function calculation unit on the FPGA: constructing a Leaky Relu logic module on the FPGA, wherein the Leaky Relu logic module is used for receiving a first input value of a Leaky Relu function, generating a first output value of the Leaky Relu function corresponding to the first input value according to the first input value, and the first input value is larger than zero and is equal to the first output value corresponding to the first input value; pre-storing a second output value of the Leaky Relu function in the LUT module, and searching a second output value corresponding to the second input value in the LUT module by taking the second input value of the Leaky Relu function as an address, wherein the second input value is less than or equal to zero;
s24, constructing a BN batch normalization calculation unit on the FPGA: constructing a BN batch normalization logic module on an FPGA, and pre-storing variances and mean values for batch normalization processing in the BN batch normalization logic module, wherein the variances for batch normalization processing are mean values of all the variances of the training samples, and the mean values for batch normalization processing are mean values of all the means of the training samples;
s25, constructing a CONV calculation unit on the FPGA: according to the successive time of convolution operation of each CONV calculation unit, constructing a CONV calculation unit at a first time on the DSP module, and after the calculation of the CONV calculation unit at the first time is completed, reconstructing a CONV calculation unit at the next time on the DSP module;
s26, constructing a storage area on the FPGA: constructing a storage area on the DDR storage module, wherein the storage area is used for storing an image matrix of a CONV calculation unit input at a first moment, and the image matrix is generated according to a prediction sample; before the image matrix is stored, grouping all rows of the image matrix to obtain a plurality of array element groups, respectively performing 90-degree matrix transposition on each array element group, writing the transposed array element groups into a storage row of a storage area according to a row sequence, wherein the written storage rows of each array element group are different, and during convolution operation, reading the array elements of the array element groups in the storage rows and inputting the read array elements into a CONV (ConV) calculation unit at a first moment;
and S3, inputting the prediction sample into the built YOLOV4 target detection model to identify a preset target.
2. The method for implementing the FPGA-based YOLOV4 target detection network according to claim 1, wherein decomposing the YOLOV4 network structure layer to obtain a plurality of computing units specifically includes:
decomposing a skeleton layer of the YOLOV4 network structure layer to obtain a CSPn calculation unit;
decomposing the neck layer of the YOLOV4 network structure layer to obtain a CBL calculation unit;
decomposing a prediction layer of a YOLOV4 network structure layer to obtain a CBL calculation unit and a CONV calculation unit;
decomposing the CSPn calculating unit to obtain a CONV calculating unit, a BN batch normalization calculating unit and a Mish function calculating unit;
and decomposing the CBL calculation unit to obtain a CONV calculation unit, a BN batch normalization calculation unit and a Leaky Relu function calculation unit.
3. The FPGA-based YOLOV4 target detection network implementation method of claim 1, wherein the input values of the Mish function are a plurality of discrete values in an interval [ -6,6], the discrete values are arranged in a size sequence, and an absolute value of a difference between two adjacent discrete values is 0.1.
4. The FPGA-based YOLOV4 target detection network implementation method of claim 1, wherein the second input value of the leak Relu function is a plurality of discrete values in an interval [ -10,0], the discrete values are arranged in a size sequence, and an absolute value of a difference between two adjacent discrete values is 0.1.
5. The FPGA-based YOLOV4 target detection network implementation method of claim 1, wherein N DSP operation sub-modules are included in the DSP module;
the constructing a CONV calculating unit at a first moment on the DSP module specifically includes:
according to the convolution kernel size n1 x n1 of the CONV calculation unit at the first moment, constructing a first sub-CONV calculation unit on a first DSP calculation submodule to an n1 x n1 th DSP calculation submodule, wherein the sub-CONV calculation unit is used for performing single-cycle convolution operation of the n1 x n1 convolution kernel, then constructing a second sub-CONV calculation unit on the n1 x n1+1 th DSP calculation submodule to a 2 x n1 th DSP calculation submodule, and so on, completing construction of M1 sub-CONV calculation units on the DSP module; if the first calculation parameter A is 1 If the value is a positive integer, M1 is taken as a first calculation parameter, and if the value is the first calculation parameter A 1 If not, M1 is less than the first calculation parameter A 1 Of positive integers of (1), wherein
Figure 486486DEST_PATH_IMAGE001
6. The implementation method of the FPGA-based YOLOV4 target detection network according to claim 5, wherein the grouping is performed on all rows of an image matrix to obtain a plurality of array elements, then each array element group is respectively subjected to 90-degree matrix transposition, then the transposed array elements are written into a storage row of a storage region according to a row sequence, the written storage rows of each array element group are different, and during convolution operation, the array elements of the array elements in the storage row are read, and the read array elements are input into the CONV calculation unit at the first time, specifically comprising:
grouping all rows of the image matrix to obtain P array element groups;
calculating a second calculation parameter
Figure 147274DEST_PATH_IMAGE002
Where L denotes the total number of rows of the image matrix, bit 1 Bus Bit width, bit, representing a memory region 0 The pixel bit width of the image matrix is represented, if the second calculation parameter is a positive integer, the number P of the array elements is the second calculation parameter, and the number of rows of each array element is
Figure 492805DEST_PATH_IMAGE003
Otherwise, the number P of the array elements is a positive integer which is larger than the second calculation parameter and has the minimum difference with the second calculation parameter, and the row number from the first array element to the P-1 array element is L 1 Number of rows of the P-th array element group
Figure 393896DEST_PATH_IMAGE004
Performing 90-degree matrix transposition on each array element group;
writing the array element groups after the conversion into a storage line of a storage area according to the line sequence, wherein the storage lines written by the array element groups are different;
when a first round of single-period convolution operation is carried out, M1 array elements are read from a storage area and output to a CONV computing unit at a first moment, and each sub-CONV computing unit receives one array element of the M1 array elements;
and when the next round of single-period convolution operation is carried out, if the number of the residual array elements in the storage area is less than M1, reading all the residual array elements from the storage area and outputting the residual array elements to the CONV calculating unit at the first moment, otherwise, reading M1 array elements from the storage area and outputting the M1 array elements to the CONV calculating unit at the first moment, until all the array elements of P storage lines are output to the CONV calculating unit at the first moment from the storage area.
CN202210983908.3A 2022-08-17 2022-08-17 FPGA-based YOLOV4 target detection network implementation method Active CN115049907B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210983908.3A CN115049907B (en) 2022-08-17 2022-08-17 FPGA-based YOLOV4 target detection network implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210983908.3A CN115049907B (en) 2022-08-17 2022-08-17 FPGA-based YOLOV4 target detection network implementation method

Publications (2)

Publication Number Publication Date
CN115049907A CN115049907A (en) 2022-09-13
CN115049907B true CN115049907B (en) 2022-10-28

Family

ID=83168081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210983908.3A Active CN115049907B (en) 2022-08-17 2022-08-17 FPGA-based YOLOV4 target detection network implementation method

Country Status (1)

Country Link
CN (1) CN115049907B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214504A (en) * 2018-08-24 2019-01-15 北京邮电大学深圳研究院 A kind of YOLO network forward inference accelerator design method based on FPGA
CN110175670A (en) * 2019-04-09 2019-08-27 华中科技大学 A kind of method and system for realizing YOLOv2 detection network based on FPGA
CN112911171A (en) * 2021-02-04 2021-06-04 上海航天控制技术研究所 Intelligent photoelectric information processing system and method based on accelerated processing
CN112926410A (en) * 2021-02-03 2021-06-08 深圳市维海德技术股份有限公司 Target tracking method and device, storage medium and intelligent video system
CN112989924A (en) * 2021-01-26 2021-06-18 深圳市优必选科技股份有限公司 Target detection method, target detection device and terminal equipment
CN114154621A (en) * 2021-11-30 2022-03-08 长沙行深智能科技有限公司 Convolutional neural network image processing method and device based on FPGA
CN114359662A (en) * 2021-12-24 2022-04-15 江苏大学 Implementation method of convolutional neural network based on heterogeneous FPGA and fusion multiresolution
CN114694002A (en) * 2022-03-11 2022-07-01 中国电子科技集团公司第五十四研究所 Infrared target detection method based on feature fusion and attention mechanism
CN114743273A (en) * 2022-04-28 2022-07-12 西安交通大学 Human skeleton behavior identification method and system based on multi-scale residual error map convolutional network
CN114863258A (en) * 2022-07-06 2022-08-05 四川迪晟新达类脑智能技术有限公司 Method for detecting small target based on visual angle conversion in sea-sky-line scene

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200348662A1 (en) * 2016-05-09 2020-11-05 Strong Force Iot Portfolio 2016, Llc Platform for facilitating development of intelligence in an industrial internet of things system
US11112784B2 (en) * 2016-05-09 2021-09-07 Strong Force Iot Portfolio 2016, Llc Methods and systems for communications in an industrial internet of things data collection environment with large data sets

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109214504A (en) * 2018-08-24 2019-01-15 北京邮电大学深圳研究院 A kind of YOLO network forward inference accelerator design method based on FPGA
CN110175670A (en) * 2019-04-09 2019-08-27 华中科技大学 A kind of method and system for realizing YOLOv2 detection network based on FPGA
CN112989924A (en) * 2021-01-26 2021-06-18 深圳市优必选科技股份有限公司 Target detection method, target detection device and terminal equipment
CN112926410A (en) * 2021-02-03 2021-06-08 深圳市维海德技术股份有限公司 Target tracking method and device, storage medium and intelligent video system
CN112911171A (en) * 2021-02-04 2021-06-04 上海航天控制技术研究所 Intelligent photoelectric information processing system and method based on accelerated processing
CN114154621A (en) * 2021-11-30 2022-03-08 长沙行深智能科技有限公司 Convolutional neural network image processing method and device based on FPGA
CN114359662A (en) * 2021-12-24 2022-04-15 江苏大学 Implementation method of convolutional neural network based on heterogeneous FPGA and fusion multiresolution
CN114694002A (en) * 2022-03-11 2022-07-01 中国电子科技集团公司第五十四研究所 Infrared target detection method based on feature fusion and attention mechanism
CN114743273A (en) * 2022-04-28 2022-07-12 西安交通大学 Human skeleton behavior identification method and system based on multi-scale residual error map convolutional network
CN114863258A (en) * 2022-07-06 2022-08-05 四川迪晟新达类脑智能技术有限公司 Method for detecting small target based on visual angle conversion in sea-sky-line scene

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
Acc-YOLOv4目标检测算法软硬件加速研究;张春野;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220315;I138-2096 *
FPGA Overlay Processor for Deep Neural Networks;Yunxuan Yu;《UNIVERSITY OF CALIFORNIA Los Angeles DOCTOR STUDENT THESES》;20201231;1-186 *
Real-Time YOLOv4 FPGA Design with Catapult High-Level Synthesis;Heinsius, L.R.;《UNIVERSITY OF TWENTE MASTER STUDENT THESES》;20211231;1-99 *
Resource- and Power-Efficient High-Performance Object Detection Inference Acceleration Using FPGA;Solomon Negussie Tesema 等;《Electronics》;20220608;1-29 *
Resource-constrained FPGA implementation of YOLOv2;Zhichao Zhang 等;《Springer:Neural Computing and Applications》;20220529;1-19 *
基于改进YOLOv4-Tiny的FPGA加速方法;曹远杰 等;《无线电工程》;20220107;604-611 *
基于深度学习的航拍图像目标检测算法设计与实现;王雪纯;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220115;I138-1387 *

Also Published As

Publication number Publication date
CN115049907A (en) 2022-09-13

Similar Documents

Publication Publication Date Title
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN109146067B (en) Policy convolution neural network accelerator based on FPGA
CN108764467B (en) Convolution operation and full-connection operation circuit for convolution neural network
CN113627601A (en) Subunit, MAC array and analog-digital mixed memory computing module with reconfigurable bit width
CN114937470B (en) Fixed point full-precision memory computing circuit based on multi-bit SRAM unit
CN114418080A (en) Storage and calculation integrated operation method, memristor neural network chip and storage medium
US11556614B2 (en) Apparatus and method for convolution operation
CN115423081A (en) Neural network accelerator based on CNN _ LSTM algorithm of FPGA
CN113345484A (en) Data operation circuit and storage and calculation integrated chip
CN105337618A (en) Multimode IRA_LDPC decoder with parallel downward compatibility and decoding method thereof
CN115049907B (en) FPGA-based YOLOV4 target detection network implementation method
CN113342310B (en) Serial parameter matched quick number theory conversion hardware accelerator for grid cipher
CN112784951A (en) Winograd convolution operation method and related product
CN109669666A (en) Multiply accumulating processor
CN113869446A (en) CNN target identification system and method based on FPGA
US20230253032A1 (en) In-memory computation device and in-memory computation method to perform multiplication operation in memory cell array according to bit orders
CN110766136B (en) Compression method of sparse matrix and vector
Chang et al. HDSuper: Algorithm-Hardware Co-design for Light-weight High-quality Super-Resolution Accelerator
CN116339680A (en) Real-time multiport parallel read-write near-memory processor
CN115618177A (en) Covariance matrix operation hardware acceleration system based on state machine
CN115600647A (en) Sparse neural network acceleration-oriented bit-level calculation model architecture system
CN113392963B (en) FPGA-based CNN hardware acceleration system design method
US11567731B2 (en) Device for computing an inner product
CN113988279A (en) Output current reading method and system of storage array supporting negative value excitation
CN109117114B (en) Low-complexity approximate multiplier based on lookup table

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant