CN110175670A - A kind of method and system for realizing YOLOv2 detection network based on FPGA - Google Patents
A kind of method and system for realizing YOLOv2 detection network based on FPGA Download PDFInfo
- Publication number
- CN110175670A CN110175670A CN201910280748.4A CN201910280748A CN110175670A CN 110175670 A CN110175670 A CN 110175670A CN 201910280748 A CN201910280748 A CN 201910280748A CN 110175670 A CN110175670 A CN 110175670A
- Authority
- CN
- China
- Prior art keywords
- module
- block
- result
- buffer
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
- Image Input (AREA)
Abstract
The invention discloses a kind of method and system that YOLOv2 detection network is realized based on FPGA, belong to Intelligent hardware field.The present invention carries out piecemeal processing to each layer of network of detection of input feature vector figure and weight parameter, selects piecemeal size according to the size that the computing resource of FPGA and on piece store.By reading parameter in batches and being calculated, intermediate result is buffered on piece storage, after waiting the final result of this layer to have been calculated, is write back in DRAM, solved Resources on Chip and memory bandwidth limitation cannot be to the defect that flood is calculated.For FPGA on piece low memory with storage model parameter, pile line operation is introduced present invention uses double buffers and between different layers, it reduces every time from DRAM reading model parameter bring time delay, the forward direction that can also be improved algorithm while greatling save required spatial cache infers speed, realize input traffic seamless buffering and processing, to the full extent using memory headroom while make full use of rate logical resource.
Description
Technical field
The invention belongs to Intelligent hardware fields, realize YOLOv2 detection network based on FPGA more particularly, to a kind of
Method and system.
Background technique
Algorithm of target detection is one of the foundation stone problem in computer vision, can be widely used in scene understanding, automatic
In the scenes such as driving, wearable device.So-called target detection refers to and target each in image is positioned and classified.With
The arrival in deep learning epoch, the algorithm of target detection based on convolutional neural networks obtain significant progress, however algorithm
It is also urgently to be resolved that problem is landed in commercialization.YOLOv2 detects network as a function admirable and can reach the mesh of requirement of real-time
Detection algorithm is marked, has structure simple, the relatively small number of advantage of the network number of plies is that algorithm of target detection is carried out industrialization landing
One well selection.
Currently, the training stage of the algorithm of target detection based on deep learning and forward direction deduction phase be all in GPU or
It is completed under CPU environment.However, although some algorithm can reach real-time processing in GPU, since the power consumption of GPU is big, by this
A little algorithms are deployed in the embedded system of low-power consumption, and the deployment scheme based on GPU is simultaneously undesirable;And it is disposed using CPU
If, it is difficult to the requirement of real-time is reached, therefore between power consumption and real-time, needs the platform scheme of a compromise.
Itd is proposed on the basis of convolutional neural networks currently based on the algorithm of target detection of deep learning, thus by these
When algorithm is transplanted in FPGA platform, researcher both domestic and external be mainly focused on how on FPGA high-speed cruising convolution
Neural network.The YOLOv2 detection each layer of network has a large amount of parameter, is related to a large amount of operation, the on piece memory space of FPGA
It is not enough to cache all parameters of flood, meanwhile, limited logical resource also is not enough to support all operations of flood.
Summary of the invention
In view of the drawbacks of the prior art, it is an object of the invention to solve the prior art YOLOv2 is transplanted to above FPGA
Face the on piece memory space inadequate of FPGA to cache all parameters of flood, at the same limited logical resource be also not enough to support it is whole
All computing problems of layer.
To achieve the above object, in a first aspect, the embodiment of the invention provides one kind to realize YOLOv2 detection based on FPGA
The method of network, the described method comprises the following steps:
It S1., will according to CBL network block, CBLM network block, four kinds of networks of CL network block and M network block and two kinds of operations
YOLOv2 detection network is divided into 23 layers and 2 operations, wherein CBL network block is by two-dimensional convolution layer, BN layers and activation primitive
Leaky ReLU is composed in series;CBLM network block is composed in series by 1 CBL network block and 1 maximum value pond layer;CL network block
It is composed in series by two-dimensional convolution layer and linear activation primitive;M network module is independent 1 maximum value pond layer;First operation be
Reorg, the second operation are Concat;And construct processing unit in the following way: 1 BN module, Leaky ReLU module and
Maximum value pond module is connected, cascaded structure and 1 Reorg wired in parallel, N_CI convolution of connecting before parallel-connection structure
Core size is K*K two-dimensional convolution module;
S2. original input picture is divided into the first layer input picture block that multiple sizes are Tci*Tr*Tc, M network will be removed
Each layer of weight parameter other than block is divided into the weight block that multiple sizes are Tc0*Tci*K*K, by all image blocks and weight
Block is stored in external memory;
S3. current layer input picture block and corresponding weight block are successively loaded from external memory buffers mould to input
Image block and weight block are loaded into m mutually independent processing units from input buffer module by block;
S4. all processing units carry out operation simultaneously, in each two-dimensional convolution module, obtained convolution results and storage
Intermediate result in output buffer module is added to obtain accumulation result, if accumulation result is still intermediate result, control
BN module, Leaky ReLU module and maximum value pond module are enabled invalid, and this accumulation result is stored in output buffering
In module;If accumulation result is final result, then judges whether to need to carry out Reorg operation, if so, control BN module,
Leaky ReLU module, maximum value pond module and Reorg module are enabled effectively, and two-way operation result is stored in output respectively
In buffer module, finally the data for exporting buffer module are write back in external memory, so that the output by Reorg module
As a result in the position that writes back of external memory, and then 21 layers of output result writes back position in external memory;Otherwise, it controls
BN module, Leaky ReLU module and maximum value pond module processed are enabled effectively, and operation result is stored in output buffer module
In, finally the data for exporting buffer module are write back in external memory;
S5. the input picture block of the final result of external memory as next layer will be write back to;
S6. step S3~S5 is repeated, until 23 layers of whole have been calculated.
Specifically, two-dimensional convolution module is realized by the dot product of sliding window and matrix;Maximum value pond module is by sliding window
And comparator is realized;Leaky ReLU module is realized by fixed-point number multiplier;It is related to output data according to input data
Position realizes that Reorg is operated using single port RAM.
Specifically, piecemeal size is selected according to the size that the computing resource of FPGA and on piece store.
Specifically, the input buffer module and output buffer module all employ double buffers, buffer when one
When block reads parameter from DRAM, data are passed to computing unit and handled by another buffer stopper.
Specifically, it in input buffer module, is calculated when data are transported to computing module by one of buffer
When, another block buffer carries data through DMA from external memory, is so used alternatingly;In output buffer module,
When the buffer of the operation result that the storage of one of buffer is intermediate, the final operation result of another piece of storage will be transported through DMA
It calculates result to write back in external memory, so be used alternatingly;Each buffer module contains two panels buffer stopper, wherein a piece of defeated
Enter the size Tci*Tco*K*K+4*Tco+Tci*Tr*Tc of buffer stopper, the size of a piece of output buffer stopper is Tco*Tr*Tc.
Specifically, the method also includes: after step S6, final output result is subjected to non-maxima suppression,
Obtain the optimum prediction frame of each target.
Second aspect, it is described the embodiment of the invention provides a kind of system for realizing YOLOv2 detection network based on FPGA
System is using the method for realizing YOLOv2 detection network based on FPGA as described in above-mentioned first aspect.
The third aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage mediums
Computer program is stored in matter, which realizes described in above-mentioned first aspect when being executed by processor based on FPGA
The method for realizing YOLOv2 detection network.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect
Fruit:
1. the present invention detects each layer of network of input feature vector figure to YOLOv2 and weight parameter carries out piecemeal processing, according to
The size of computing resource and the on piece storage of FPGA selects piecemeal size.By reading parameter in batches and being calculated, will in
Between result cache on piece storage on, after waiting the final result of this layer to have been calculated, write back in DRAM, to solve piece
Upper resource and memory bandwidth limitation, can not be to the defect that flood is calculated.
2. for FPGA on piece low memory to store algorithm model parameter, present invention uses double buffers and
Pile line operation is introduced between different layers, when a buffer stopper reads parameter from DRAM, another buffer stopper
Data are passed to computing unit to handle, reduces every time from DRAM reading model parameter bring time delay, is greatling save
The forward direction that can also be improved algorithm while required spatial cache infers speed, realize input traffic seamless buffering and
Processing, to the full extent using memory headroom while make full use of rate logical resource.
Detailed description of the invention
Fig. 1 is a kind of method flow diagram that YOLOv2 detection network is realized based on FPGA provided in an embodiment of the present invention;
Fig. 2 is YOLO v2 network layer layered structure schematic diagram provided in an embodiment of the present invention;
Fig. 3 is processing unit structural schematic diagram provided in an embodiment of the present invention;
Fig. 4 is that two-dimensional convolution module provided in an embodiment of the present invention realizes process schematic;
Fig. 5 is a kind of system structure signal that YOLOv2 detection network is realized based on FPGA provided in an embodiment of the present invention
Figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
As shown in Figure 1, the present invention proposes a kind of method for realizing YOLOv2 detection network based on FPGA, the method includes
Following steps:
It S1., will according to CBL network block, CBLM network block, four kinds of networks of CL network block and M network block and two kinds of operations
YOLOv2 detection network is divided into 23 layers and 2 operations, wherein CBL network block is by two-dimensional convolution layer, BN layers and activation primitive
Leaky ReLU is composed in series;CBLM network block is composed in series by 1 CBL network block and 1 maximum value pond layer;CL network block
It is composed in series by two-dimensional convolution layer and linear activation primitive;M network module is independent 1 maximum value pond layer;First operation be
Reorg, the second operation are Concat;And construct processing unit in the following way: 1 BN module, Leaky ReLU module and
Maximum value pond module is connected, cascaded structure and 1 Reorg wired in parallel, N_CI convolution of connecting before parallel-connection structure
Core size is K*K two-dimensional convolution module;
S2. original input picture is divided into the first layer input picture block that multiple sizes are Tci*Tr*Tc, M network will be removed
Each layer of weight parameter other than block is divided into the weight block that multiple sizes are Tc0*Tci*K*K, by all image blocks and weight
Block is stored in external memory;
S3. current layer input picture block and corresponding weight block are successively loaded from external memory buffers mould to input
Image block and weight block are loaded into m mutually independent processing units from input buffer module by block;
S4. all processing units carry out operation simultaneously, in each two-dimensional convolution module, obtained convolution results and storage
Intermediate result in output buffer module is added to obtain accumulation result, if accumulation result is still intermediate result, control
BN module, Leaky ReLU module and maximum value pond module are enabled invalid, and this accumulation result is stored in output buffering
In module;If accumulation result is final result, then judges whether to need to carry out Reorg operation, if so, control BN module,
Leaky ReLU module, maximum value pond module and Reorg module are enabled effectively, and two-way operation result is stored in output respectively
In buffer module, finally the data for exporting buffer module are write back in external memory, so that the output by Reorg module
As a result in the position that writes back of external memory, and then 21 layers of output result writes back position in external memory;Otherwise, it controls
BN module, Leaky ReLU module and maximum value pond module processed are enabled effectively, and operation result is stored in output buffer module
In, finally the data for exporting buffer module are write back in external memory;
S5. the input picture block of the final result of external memory as next layer will be write back to;
S6. step S3~S5 is repeated, until 23 layers of whole have been calculated.
Step S1. is operated according to CBL network block, CBLM network block, four kinds of networks of CL network block and M network block and two kinds,
YOLOv2 detection network is divided into 23 layers and 2 operations, wherein CBL network block is by two-dimensional convolution layer, BN layers and activation primitive
Leaky ReLU is composed in series;CBLM network block is composed in series by 1 CBL network block and 1 maximum value pond layer;CL network block
It is composed in series by two-dimensional convolution layer and linear activation primitive;M network module is independent 1 maximum value pond layer;First operation be
Reorg, the second operation are Concat;And construct processing unit in the following way: 1 BN module, Leaky ReLU module and
Maximum value pond module is connected, cascaded structure and 1 Reorg wired in parallel, N_CI convolution of connecting before parallel-connection structure
Core size is K*K two-dimensional convolution module.
Convolutional neural networks are mainly made of convolutional layer, BN layers, pond layer and activation primitive, and convolutional layer has calculating close
The features such as collection, weight are shared, the computation-intensive calculation amount for referring to convolutional neural networks is all in convolutional layer substantially, and weight is shared
It is that characteristic pattern (Feature Map) different location of input shares the weight of convolution kernel.The input data dimension of convolutional layer is
It is three-dimensional, deconvolution parameter dimension be it is four-dimensional, output data be it is three-dimensional, its nucleus module is two-dimensional convolution.
As shown in Fig. 2, in order to introduce assembly line between convolutional layer, BN layers, activation primitive and pond layer, the present invention will
YOLO v2 network layer type is divided into four kinds, the first is the combination (letter of convolutional layer, BN layers and activation primitive Leaky ReLU
Claim CBL), second is convolutional layer, BN layers, activation primitive Leaky ReLU and Max pooling (abbreviation CBLM), the third is
The combination (abbreviation CL) of convolutional layer and linear activation primitive, the calculation amount of these three network layers are all located in convolutional layer.4th kind
It is an independent maximum value pond layer (abbreviation M).After being divided to YOLO v2 by above-mentioned network layer type, YOLO v2's
The network number of plies is 23.Additionally to divide two kinds of action types: the stacking for rearranging (Reorg) and characteristic pattern of characteristic pattern
(Concat)。
The calculation amount of YOLOv2 detection network is all located at convolutional layer.The calculating process of convolutional layer, there are four kinds of independence: same
Each weight is independent from each other in one two-dimensional convolution core, the two-dimensional convolution of different input channels is independent from each other, is same
It is mutually indepedent that different convolution windows, which is independent from each other with the two-dimensional convolution of different output channels, in a Feature Map
's.By four kinds of concurrencys in the available convolution process of these four independence.
The concurrency of pond layer and the concurrency of convolutional layer are very similar, can from input channel, pond window interior and
Degree of parallelism is introduced in the window of same characteristic pattern difference pond.In pond, the pond window interior degree of parallelism of layer is 4, in same Zhang Te
The degree of parallelism that sign schemes upper different pond windows is M, and the degree of parallelism introduced in input channel is N, thus total degree of parallelism for 4 ×
M×N。
In order to introduce degree of parallelism in input channel, each processing unit is made of multiple two dimensional convolvers.In order to increase
The processing speed of design can carry out flowing water to activation primitive module and maximum value pond module when obtaining final convolution results
Line operation.As shown in figure 3, processing unit includes N_CI two-dimensional convolution module, 1 BN module, Leaky ReLU module, maximum
It is worth pond module and Reorg module.1 BN module, Leaky ReLU module and maximum value pond module are connected, tandem junction
Structure and 1 Reorg wired in parallel, N_CI two-dimensional convolution module of connecting before parallel-connection structure, obtain processing unit.N_CI is preferred
It is 4.
Convolutional layer conv directly utilizes FIFO (first in first out) and shift register to generate sliding window circuit, and by sliding
Window circuit realizes that pond layer max pooling directly utilizes FIFO (first in first out) and shift LD with matrix dot product circuit
Device is realized that activation primitive only needs fixed-point number multiplier to generate sliding window circuit by sliding window circuit and comparator
It realizes, in Reorg layers of realization, according to the relevant bits of each pixel of input feature vector figure and the output each pixel of characteristic pattern
Relationship is set, can be realized using single port RAM.
Assuming that two-dimensional convolution core size is K × K, input picture size is W × H, can use K depth greater than W's
FIFO and shift register realize sliding window.In the present invention K include two kinds of values, respectively 3 and 1.It is big with convolution kernel
Small is 3 × 3, for the two-dimensional convolution structure that input picture size is 5 × 5, as shown in figure 4, at this point, being greater than 5 using 3 depth
FIFO and 3 × 3 shift register come realize size be 3 × 3 sliding window.
Maximum value pond is broadly divided into the generation of pond window compared with pixel value.Pond layer mainly by sliding window with
And comparator is realized.Pond window generates the generation that can refer to upper trifle convolution window.And the maximum in the window of pond
Value can be obtained by comparing circuit.Chi Huahe size is 2*2.
The activation primitive that YOLO v2 detects Web vector graphic is Leaky ReLU, its mathematic(al) representation is shown below:
Coefficient is 0.1, therefore unsaturation situation need to be only considered when the bit wide of result is truncated.Activation primitive only needs
Fixed-point number multiplier can be realized.
Reorg layers for redistributing the characteristic pattern of input according to the relationship between input and the position of output
Arrangement.Reorg layers can expand into an input channel four output channels, meanwhile, the size of the characteristic pattern of output
For input 1/4.Reorg layers are realized on FPGA, can first seek the characteristic pattern of the characteristic pattern each pixel and input of output
The correlativity of each pixel position, the positional relationship of the two are shown below:
Col=(count-1) %W
inindex=col+W* (row+H*Ci)
Wherein, count is the number for inputting pixel, CiFor current channel, W and H be respectively input feature vector figure width with
Height, (row, col) are specific location of the pixel in input feature vector figure, FinFor pixel inlet flow, FoutFor pixel output stream.
There is above-mentioned relation, can be realized Reorg layers using a single port RAM.During specific implementation, a meter can use
Number device counts the number of input data, restores which port number C is input data be located at by count resultsiAnd
The specific location (row, col) of input feature vector figure, because the wide W and high H of input feature vector figure are it is known that therefore according to above-mentioned formula
Present input data corresponding address in RAM can be calculated.
Original input picture is divided into the first layer input picture block that multiple sizes are Tci*Tr*Tc by step S2., will remove M
Each layer of weight parameter other than network block is divided into the weight block that multiple sizes are Tc0*Tci*K*K, by all image blocks and
Weight block is stored in external memory.
Since on piece computing resource and memory space are all limited, YOLOv2 detection network is disposed on FPGA and is needed
Its each layer is divided, by convolutional layer, BN layers, Leaky ReLU and pond layer as one layer, therefore is obtaining convolutional layer
Result after can execute BN, Leaky ReLU and maximum value pond etc. operate, realize a pile line operation.So right
Each layer of division that is to say and divide to convolutional layer.Piecemeal is carried out to each layer of input feature vector figure and weight parameter again.Root
Piecemeal size is selected according to the size that the computing resource and on piece of FPGA stores.Step S1~S2 is to realize on FPGA
The preparation of the input data of YOLOv2 network.
A convolutional layer is realized on FPGA, the present invention will originally input dimension as CH_IN × Rin × Cin, four-dimensional convolution
Core dimension is CH_OUT × CH_IN × K × K, output dimension is that CH_OUT × R × C Three dimensional convolution is divided into multiple input dimensions
Degree is Tci × TRin × TCin, four-dimensional convolution kernel dimension is Tco × Tci × K × K, output dimension is Tco × Tr × Tc three-dimensional
Convolution, wherein Rin × Cin is the size of input feature vector figure, and R × C is the size for exporting characteristic pattern, the relationship of the two
It can refer to formula.Dividing obtained Three dimensional convolution includes Tco × Tci two-dimensional convolution.
Step S3. successively loads current layer input picture block and corresponding weight block from external memory and buffers to input
Image block and weight block are loaded into m mutually independent processing units from input buffer module by module.
In each layer in calculating process, because Resources on Chip and memory bandwidth limitation, can not calculate flood,
Parameter can only be read in batches and be calculated.According to pixel coordinate by controller from external memory load pixel value, and according to
Current input channel and output channel load corresponding parameter from external memory by controller.The preferred m=of the embodiment of the present invention
16.In order to introduce degree of parallelism in output channel, present invention employs multiple processing units to calculate different output channels simultaneously
Two-dimensional convolution.
All processing units of step S4. carry out operation simultaneously, in each two-dimensional convolution module, obtained convolution results and
The intermediate result being stored in output buffer module is added to obtain accumulation result, if accumulation result is still intermediate result,
It is enabled invalid to control BN module, Leaky ReLU module and maximum value pond module, and this accumulation result is stored in output
In buffer module;If accumulation result is final result, then judges whether to need to carry out Reorg operation, if so, control BN mould
Block, Leaky ReLU module, maximum value pond module and Reorg module are enabled effectively, and two-way operation result is stored in respectively
It exports in buffer module, finally writes back to the data for exporting buffer module in external memory, so that by Reorg module
In the position that writes back of external memory, and then 21 layers of output result writes back position in external memory to output result;It is no
Then, it is enabled effectively that BN module, Leaky ReLU module and maximum value pond module are controlled, operation result is stored in output buffering
In module, finally the data for exporting buffer module are write back in external memory.
Intermediate result is buffered on piece storage, after waiting the final result of this layer to have been calculated, is write back to DRAM
In.In order to reduce every time from DRAM reading model parameter bring time delay, the seamless buffering and processing of input traffic are realized,
Present invention uses Double buffer block mechanism, and when a buffer stopper reads parameter from DRAM, another buffer stopper will be counted
It is handled according to computing unit is passed to.
It is to realize that Concat is operated to whether needing to carry out the judgement of Reorg operation.
Input buffer module and output buffer module all employ double buffers, that is to say ping-pong operation.It is inputting
In buffer module, when data are transported to computing module by one of buffer to be calculated, another block buffer warp
DMA carries data from external memory, is so used alternatingly;In output buffer module, when one of buffer stores
Operation result is write back to external storage through DMA by the buffer of intermediate operation result, the final operation result of another piece of storage
In device, so it is used alternatingly.The seamless buffering and processing of data may be implemented using double buffer module.Each buffer module
Containing two panels buffer stopper, wherein the size Tci*Tco*K*K+4*Tco+Tci*Tr*Tc of a piece of input buffer stopper, a piece of output are slow
The size for rushing block is Tco*Tr*Tc.
Step S5. will write back to the input picture block of the final result of external memory as next layer.
The output image block of current layer is next layer of input picture block.
Step S6. repeats step S3~S5, until 23 layers of whole have been calculated.
FPGA platform cannot support YOLOv2 to detect the operation of network whole simultaneously, therefore can only successively calculate.It will be final
It exports result and carries out non-maxima suppression, obtain the optimum prediction frame of each target.
As shown in figure 5, a kind of system that YOLOv2 detection network is realized based on FPGA, the system comprises: processing system
PS and programmable logic PL;
Processing system PS includes central processing unit and external memory, and central processing unit is responsible for dispatching YOLOv2 detection
To the process of deduction and configuration DMA before network, and non-pole is carried out to by the last operation result of YOLOv2 detection network
Big value inhibition obtains the classification of each target and its corresponding position in image;External memory is responsible for storing YOLOv2 detection
The model parameter and image data of network;
Programmable logic PL is by dma module (Direct Memory Access, direct memory access), controller, input
Buffer module, output buffer module, decoder module and six part of computing module composition.
Dma module is the transmission that data and instruction are carried out between the end PS and the piece upper bumper at the end PL.
Controller from external memory acquisition instruction and dispatch input buffer module, output buffer module, decoder module
And computing module.Specifically, controller successively loads current layer input picture block and corresponding weight from external memory
Block is loaded into m mutually independent processing units from input buffer module to input buffer module, by image block and weight block;Institute
There is processing unit while carrying out operation, in each two-dimensional convolution module, obtained convolution results buffer mould with output is stored in
Intermediate result in block is added to obtain accumulation result, if accumulation result is still intermediate result, controls BN module, Leaky
ReLU module and maximum value pond module are enabled invalid, and this accumulation result is stored in output buffer module;If tired
Adding result is final result, then judges whether to need to carry out Reorg operation, if so, control BN module, Leaky ReLU module,
Maximum value pond module and Reorg module are enabled effectively, two-way operation result are stored in respectively in output buffer module, finally
The data for exporting buffer module are write back in external memory, so that the output result by Reorg module is in external storage
And then 21 layers of output result writes back position in external memory for the position that writes back of device;Otherwise, BN module, Leaky are controlled
ReLU module and maximum value pond module are enabled effectively, and operation result is stored in output buffer module, finally that output is slow
The data of die block write back in external memory;The input figure of the final result of external memory as next layer will be write back to
As block.
Computing module be responsible for YOLOv2 detection network before to deduction calculating, including convolutional layer, BN layers, Leaky ReLU with
And the calculating in maximum value pond.
More than, the only preferable specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any
Within the technical scope of the present application, any changes or substitutions that can be easily thought of by those familiar with the art, all answers
Cover within the scope of protection of this application.Therefore, the protection scope of the application should be subject to the protection scope in claims.
Claims (8)
1. a kind of method for realizing YOLOv2 detection network based on FPGA, which is characterized in that the described method comprises the following steps:
S1. according to CBL network block, CBLM network block, four kinds of networks of CL network block and M network block and two kinds of operations, by YOLOv2
Detection network is divided into 23 layers and 2 operations, wherein CBL network block is by two-dimensional convolution layer, BN layers and activation primitive Leaky ReLU
It is composed in series;CBLM network block is composed in series by 1 CBL network block and 1 maximum value pond layer;CL network block is by two-dimensional convolution
Layer and linear activation primitive are composed in series;M network module is independent 1 maximum value pond layer;First operation be Reorg, second
Operation is Concat;And processing unit: 1 BN module, Leaky ReLU module and maximum value pond is constructed in the following way
Module is connected, cascaded structure and 1 Reorg wired in parallel, and N_CI convolution kernel size of connecting before parallel-connection structure is K*K
Two-dimensional convolution module;
S2. by original input picture be divided into multiple sizes be Tci*Tr*Tc first layer input picture block, will except M network block with
Each layer outer of weight parameter is divided into the weight block that multiple sizes are Tc0*Tci*K*K, and all image blocks and weight block are deposited
It is placed on external memory;
S3. current layer input picture block and corresponding weight block are successively loaded from external memory to input buffer module, are incited somebody to action
Image block and weight block are loaded into m mutually independent processing units from input buffer module;
S4. all processing units carry out operation simultaneously, in each two-dimensional convolution module, obtained convolution results and are stored in defeated
The intermediate result in buffer module is added to obtain accumulation result out, if accumulation result is still intermediate result, controls BN mould
Block, Leaky ReLU module and maximum value pond module are enabled invalid, and this accumulation result is stored in output buffer module
In;If accumulation result is final result, then judges whether to need to carry out Reorg operation, if so, control BN module, Leaky
ReLU module, maximum value pond module and Reorg module are enabled effectively, and two-way operation result is stored in output buffering mould respectively
In block, finally the data for exporting buffer module are write back in external memory, so that the output result by Reorg module exists
And then 21 layers of output result writes back position in external memory for the position that writes back of external memory;Otherwise, BN mould is controlled
Block, Leaky ReLU module and maximum value pond module are enabled effectively, operation result are stored in output buffer module, finally
The data for exporting buffer module are write back in external memory;
S5. the input picture block of the final result of external memory as next layer will be write back to;
S6. step S3~S5 is repeated, until 23 layers of whole have been calculated.
2. the method as described in claim 1, which is characterized in that two-dimensional convolution module is by the dot product of sliding window and matrix Lai real
It is existing;Maximum value pond module is realized by sliding window and comparator;Leaky ReLU module is realized by fixed-point number multiplier;According to
The relevant position of input data and output data realizes that Reorg is operated using single port RAM.
3. the method as described in claim 1, which is characterized in that selected according to the size that the computing resource of FPGA and on piece store
Select piecemeal size.
4. the method as described in claim 1, which is characterized in that the input buffer module and output buffer module all use
Double buffers, when a buffer stopper reads parameter from DRAM, data are passed to calculating by another buffer stopper
Unit is handled.
5. the method as described in claim 1, which is characterized in that in input buffer module, when one of buffer will count
According to when being transported to computing module and calculated, another block buffer carries data through DMA from external memory, so hands over
For use;In output buffer module, when the operation result that the storage of one of buffer is intermediate, the final fortune of another piece of storage
The buffer for calculating result writes back to operation result in external memory through DMA, is so used alternatingly;Each buffer module contains
There is two panels buffer stopper, wherein the size Tci*Tco*K*K+4*Tco+Tci*Tr*Tc of a piece of input buffer stopper, a piece of output buffering
The size of block is Tco*Tr*Tc.
6. the method as described in claim 1, which is characterized in that the method also includes: it, will be final defeated after step S6
Result carries out non-maxima suppression out, obtains the optimum prediction frame of each target.
7. a kind of system for realizing YOLOv2 detection network based on FPGA, which is characterized in that the system uses such as claim 1
To 6 described in any item methods for realizing YOLOv2 detection network based on FPGA.
8. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program is realized when the computer program is executed by processor and is realized as claimed in any one of claims 1 to 6 based on FPGA
The method of YOLOv2 detection network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910280748.4A CN110175670B (en) | 2019-04-09 | 2019-04-09 | Method and system for realizing YOLOv2 detection network based on FPGA |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910280748.4A CN110175670B (en) | 2019-04-09 | 2019-04-09 | Method and system for realizing YOLOv2 detection network based on FPGA |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110175670A true CN110175670A (en) | 2019-08-27 |
CN110175670B CN110175670B (en) | 2020-12-08 |
Family
ID=67689598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910280748.4A Active CN110175670B (en) | 2019-04-09 | 2019-04-09 | Method and system for realizing YOLOv2 detection network based on FPGA |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110175670B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717588A (en) * | 2019-10-15 | 2020-01-21 | 百度在线网络技术(北京)有限公司 | Apparatus and method for convolution operation |
CN111459877A (en) * | 2020-04-02 | 2020-07-28 | 北京工商大学 | FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method |
CN111860781A (en) * | 2020-07-10 | 2020-10-30 | 逢亿科技(上海)有限公司 | Convolutional neural network feature decoding system realized based on FPGA |
CN111967572A (en) * | 2020-07-10 | 2020-11-20 | 逢亿科技(上海)有限公司 | FPGA-based YOLO V3 and YOLO V3 Tiny network switching method |
CN113139519A (en) * | 2021-05-14 | 2021-07-20 | 陕西科技大学 | Target detection system based on fully programmable system on chip |
CN113495786A (en) * | 2020-03-19 | 2021-10-12 | 杭州海康威视数字技术股份有限公司 | Image convolution processing method and electronic equipment |
CN115049907A (en) * | 2022-08-17 | 2022-09-13 | 四川迪晟新达类脑智能技术有限公司 | FPGA-based YOLOV4 target detection network implementation method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108154229A (en) * | 2018-01-10 | 2018-06-12 | 西安电子科技大学 | Accelerate the image processing method of convolutional neural networks frame based on FPGA |
EP3352113A1 (en) * | 2017-01-18 | 2018-07-25 | Hitachi, Ltd. | Calculation system and calculation method of neural network |
WO2018184192A1 (en) * | 2017-04-07 | 2018-10-11 | Intel Corporation | Methods and systems using camera devices for deep channel and convolutional neural network images and formats |
CN108805274A (en) * | 2018-05-28 | 2018-11-13 | 重庆大学 | The hardware-accelerated method and system of Tiny-yolo convolutional neural networks based on FPGA |
CN109214504A (en) * | 2018-08-24 | 2019-01-15 | 北京邮电大学深圳研究院 | A kind of YOLO network forward inference accelerator design method based on FPGA |
CN109447893A (en) * | 2019-01-28 | 2019-03-08 | 深兰人工智能芯片研究院(江苏)有限公司 | A kind of convolutional neural networks FPGA accelerate in image preprocessing method and device |
-
2019
- 2019-04-09 CN CN201910280748.4A patent/CN110175670B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3352113A1 (en) * | 2017-01-18 | 2018-07-25 | Hitachi, Ltd. | Calculation system and calculation method of neural network |
WO2018184192A1 (en) * | 2017-04-07 | 2018-10-11 | Intel Corporation | Methods and systems using camera devices for deep channel and convolutional neural network images and formats |
CN108154229A (en) * | 2018-01-10 | 2018-06-12 | 西安电子科技大学 | Accelerate the image processing method of convolutional neural networks frame based on FPGA |
CN108805274A (en) * | 2018-05-28 | 2018-11-13 | 重庆大学 | The hardware-accelerated method and system of Tiny-yolo convolutional neural networks based on FPGA |
CN109214504A (en) * | 2018-08-24 | 2019-01-15 | 北京邮电大学深圳研究院 | A kind of YOLO network forward inference accelerator design method based on FPGA |
CN109447893A (en) * | 2019-01-28 | 2019-03-08 | 深兰人工智能芯片研究院(江苏)有限公司 | A kind of convolutional neural networks FPGA accelerate in image preprocessing method and device |
Non-Patent Citations (5)
Title |
---|
DING, CW 等: "REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs", 《PROCEEDINGS OF THE 2019 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS》 * |
DUY THANH NGUYEN 等: "A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection", 《IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS》 * |
WAI, Y.J.等: "Fixed Point Implementation of Tiny-Yolo-v2 using OpenCL on FPGA", 《INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS》 * |
张霎轲 等: "基于小型Zynq SoC硬件加速的改进TINY YOLO实时车辆检测算法实现", 《计算机应用》 * |
段秉环 等: "面向嵌入式应用的深度神经网络压缩方法研究", 《航空计算技术》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11556614B2 (en) | 2019-10-15 | 2023-01-17 | Apollo Intelligent Driving Technology (Beijing) Co., Ltd. | Apparatus and method for convolution operation |
CN110717588B (en) * | 2019-10-15 | 2022-05-03 | 阿波罗智能技术(北京)有限公司 | Apparatus and method for convolution operation |
CN110717588A (en) * | 2019-10-15 | 2020-01-21 | 百度在线网络技术(北京)有限公司 | Apparatus and method for convolution operation |
CN113495786A (en) * | 2020-03-19 | 2021-10-12 | 杭州海康威视数字技术股份有限公司 | Image convolution processing method and electronic equipment |
CN113495786B (en) * | 2020-03-19 | 2023-10-13 | 杭州海康威视数字技术股份有限公司 | Image convolution processing method and electronic equipment |
CN111459877A (en) * | 2020-04-02 | 2020-07-28 | 北京工商大学 | FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method |
CN111459877B (en) * | 2020-04-02 | 2023-03-24 | 北京工商大学 | Winograd YOLOv2 target detection model method based on FPGA acceleration |
CN111860781A (en) * | 2020-07-10 | 2020-10-30 | 逢亿科技(上海)有限公司 | Convolutional neural network feature decoding system realized based on FPGA |
CN111967572A (en) * | 2020-07-10 | 2020-11-20 | 逢亿科技(上海)有限公司 | FPGA-based YOLO V3 and YOLO V3 Tiny network switching method |
CN113139519A (en) * | 2021-05-14 | 2021-07-20 | 陕西科技大学 | Target detection system based on fully programmable system on chip |
CN113139519B (en) * | 2021-05-14 | 2023-12-22 | 陕西科技大学 | Target detection system based on fully programmable system-on-chip |
CN115049907A (en) * | 2022-08-17 | 2022-09-13 | 四川迪晟新达类脑智能技术有限公司 | FPGA-based YOLOV4 target detection network implementation method |
CN115049907B (en) * | 2022-08-17 | 2022-10-28 | 四川迪晟新达类脑智能技术有限公司 | FPGA-based YOLOV4 target detection network implementation method |
Also Published As
Publication number | Publication date |
---|---|
CN110175670B (en) | 2020-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110175670A (en) | A kind of method and system for realizing YOLOv2 detection network based on FPGA | |
CN109815886B (en) | Pedestrian and vehicle detection method and system based on improved YOLOv3 | |
CN108108809B (en) | Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof | |
CN108985450B (en) | Vector processor-oriented convolution neural network operation vectorization method | |
CN109214504B (en) | FPGA-based YOLO network forward reasoning accelerator design method | |
CN108171317A (en) | A kind of data-reusing convolutional neural networks accelerator based on SOC | |
KR20180034557A (en) | Improving the performance of a two-dimensional array processor | |
CN106897143A (en) | Area's piece distribution to the treatment engine in graphic system | |
US20210019594A1 (en) | Convolutional neural network accelerating device and method | |
KR20180123846A (en) | Logical-3d array reconfigurable accelerator for convolutional neural networks | |
CN110533022B (en) | Target detection method, system, device and storage medium | |
CN105825468A (en) | Graphics processing unit and graphics processing method thereof | |
CN113408423A (en) | Aquatic product target real-time detection method suitable for TX2 embedded platform | |
Li et al. | High throughput hardware architecture for accurate semi-global matching | |
CN113743505A (en) | Improved SSD target detection method based on self-attention and feature fusion | |
CN110598844A (en) | Parallel convolution neural network accelerator based on FPGA and acceleration method | |
CN109472734B (en) | Target detection network based on FPGA and implementation method thereof | |
JP2021515339A (en) | Machine perception and high density algorithm integrated circuits | |
CN112101113B (en) | Lightweight unmanned aerial vehicle image small target detection method | |
CN117217274A (en) | Vector processor, neural network accelerator, chip and electronic equipment | |
CN112149518A (en) | Pine cone detection method based on BEGAN and YOLOV3 models | |
CN108197613B (en) | Face detection optimization method based on deep convolution cascade network | |
CN115640833A (en) | Accelerator and acceleration method for sparse convolutional neural network | |
CN113902904A (en) | Lightweight network architecture system | |
CN111832336B (en) | Improved C3D video behavior detection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |