CN110175670A - A kind of method and system for realizing YOLOv2 detection network based on FPGA - Google Patents

A kind of method and system for realizing YOLOv2 detection network based on FPGA Download PDF

Info

Publication number
CN110175670A
CN110175670A CN201910280748.4A CN201910280748A CN110175670A CN 110175670 A CN110175670 A CN 110175670A CN 201910280748 A CN201910280748 A CN 201910280748A CN 110175670 A CN110175670 A CN 110175670A
Authority
CN
China
Prior art keywords
module
block
result
buffer
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910280748.4A
Other languages
Chinese (zh)
Other versions
CN110175670B (en
Inventor
何兆华
高常鑫
桑农
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201910280748.4A priority Critical patent/CN110175670B/en
Publication of CN110175670A publication Critical patent/CN110175670A/en
Application granted granted Critical
Publication of CN110175670B publication Critical patent/CN110175670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Image Input (AREA)

Abstract

The invention discloses a kind of method and system that YOLOv2 detection network is realized based on FPGA, belong to Intelligent hardware field.The present invention carries out piecemeal processing to each layer of network of detection of input feature vector figure and weight parameter, selects piecemeal size according to the size that the computing resource of FPGA and on piece store.By reading parameter in batches and being calculated, intermediate result is buffered on piece storage, after waiting the final result of this layer to have been calculated, is write back in DRAM, solved Resources on Chip and memory bandwidth limitation cannot be to the defect that flood is calculated.For FPGA on piece low memory with storage model parameter, pile line operation is introduced present invention uses double buffers and between different layers, it reduces every time from DRAM reading model parameter bring time delay, the forward direction that can also be improved algorithm while greatling save required spatial cache infers speed, realize input traffic seamless buffering and processing, to the full extent using memory headroom while make full use of rate logical resource.

Description

A kind of method and system for realizing YOLOv2 detection network based on FPGA
Technical field
The invention belongs to Intelligent hardware fields, realize YOLOv2 detection network based on FPGA more particularly, to a kind of Method and system.
Background technique
Algorithm of target detection is one of the foundation stone problem in computer vision, can be widely used in scene understanding, automatic In the scenes such as driving, wearable device.So-called target detection refers to and target each in image is positioned and classified.With The arrival in deep learning epoch, the algorithm of target detection based on convolutional neural networks obtain significant progress, however algorithm It is also urgently to be resolved that problem is landed in commercialization.YOLOv2 detects network as a function admirable and can reach the mesh of requirement of real-time Detection algorithm is marked, has structure simple, the relatively small number of advantage of the network number of plies is that algorithm of target detection is carried out industrialization landing One well selection.
Currently, the training stage of the algorithm of target detection based on deep learning and forward direction deduction phase be all in GPU or It is completed under CPU environment.However, although some algorithm can reach real-time processing in GPU, since the power consumption of GPU is big, by this A little algorithms are deployed in the embedded system of low-power consumption, and the deployment scheme based on GPU is simultaneously undesirable;And it is disposed using CPU If, it is difficult to the requirement of real-time is reached, therefore between power consumption and real-time, needs the platform scheme of a compromise.
Itd is proposed on the basis of convolutional neural networks currently based on the algorithm of target detection of deep learning, thus by these When algorithm is transplanted in FPGA platform, researcher both domestic and external be mainly focused on how on FPGA high-speed cruising convolution Neural network.The YOLOv2 detection each layer of network has a large amount of parameter, is related to a large amount of operation, the on piece memory space of FPGA It is not enough to cache all parameters of flood, meanwhile, limited logical resource also is not enough to support all operations of flood.
Summary of the invention
In view of the drawbacks of the prior art, it is an object of the invention to solve the prior art YOLOv2 is transplanted to above FPGA Face the on piece memory space inadequate of FPGA to cache all parameters of flood, at the same limited logical resource be also not enough to support it is whole All computing problems of layer.
To achieve the above object, in a first aspect, the embodiment of the invention provides one kind to realize YOLOv2 detection based on FPGA The method of network, the described method comprises the following steps:
It S1., will according to CBL network block, CBLM network block, four kinds of networks of CL network block and M network block and two kinds of operations YOLOv2 detection network is divided into 23 layers and 2 operations, wherein CBL network block is by two-dimensional convolution layer, BN layers and activation primitive Leaky ReLU is composed in series;CBLM network block is composed in series by 1 CBL network block and 1 maximum value pond layer;CL network block It is composed in series by two-dimensional convolution layer and linear activation primitive;M network module is independent 1 maximum value pond layer;First operation be Reorg, the second operation are Concat;And construct processing unit in the following way: 1 BN module, Leaky ReLU module and Maximum value pond module is connected, cascaded structure and 1 Reorg wired in parallel, N_CI convolution of connecting before parallel-connection structure Core size is K*K two-dimensional convolution module;
S2. original input picture is divided into the first layer input picture block that multiple sizes are Tci*Tr*Tc, M network will be removed Each layer of weight parameter other than block is divided into the weight block that multiple sizes are Tc0*Tci*K*K, by all image blocks and weight Block is stored in external memory;
S3. current layer input picture block and corresponding weight block are successively loaded from external memory buffers mould to input Image block and weight block are loaded into m mutually independent processing units from input buffer module by block;
S4. all processing units carry out operation simultaneously, in each two-dimensional convolution module, obtained convolution results and storage Intermediate result in output buffer module is added to obtain accumulation result, if accumulation result is still intermediate result, control BN module, Leaky ReLU module and maximum value pond module are enabled invalid, and this accumulation result is stored in output buffering In module;If accumulation result is final result, then judges whether to need to carry out Reorg operation, if so, control BN module, Leaky ReLU module, maximum value pond module and Reorg module are enabled effectively, and two-way operation result is stored in output respectively In buffer module, finally the data for exporting buffer module are write back in external memory, so that the output by Reorg module As a result in the position that writes back of external memory, and then 21 layers of output result writes back position in external memory;Otherwise, it controls BN module, Leaky ReLU module and maximum value pond module processed are enabled effectively, and operation result is stored in output buffer module In, finally the data for exporting buffer module are write back in external memory;
S5. the input picture block of the final result of external memory as next layer will be write back to;
S6. step S3~S5 is repeated, until 23 layers of whole have been calculated.
Specifically, two-dimensional convolution module is realized by the dot product of sliding window and matrix;Maximum value pond module is by sliding window And comparator is realized;Leaky ReLU module is realized by fixed-point number multiplier;It is related to output data according to input data Position realizes that Reorg is operated using single port RAM.
Specifically, piecemeal size is selected according to the size that the computing resource of FPGA and on piece store.
Specifically, the input buffer module and output buffer module all employ double buffers, buffer when one When block reads parameter from DRAM, data are passed to computing unit and handled by another buffer stopper.
Specifically, it in input buffer module, is calculated when data are transported to computing module by one of buffer When, another block buffer carries data through DMA from external memory, is so used alternatingly;In output buffer module, When the buffer of the operation result that the storage of one of buffer is intermediate, the final operation result of another piece of storage will be transported through DMA It calculates result to write back in external memory, so be used alternatingly;Each buffer module contains two panels buffer stopper, wherein a piece of defeated Enter the size Tci*Tco*K*K+4*Tco+Tci*Tr*Tc of buffer stopper, the size of a piece of output buffer stopper is Tco*Tr*Tc.
Specifically, the method also includes: after step S6, final output result is subjected to non-maxima suppression, Obtain the optimum prediction frame of each target.
Second aspect, it is described the embodiment of the invention provides a kind of system for realizing YOLOv2 detection network based on FPGA System is using the method for realizing YOLOv2 detection network based on FPGA as described in above-mentioned first aspect.
The third aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage mediums Computer program is stored in matter, which realizes described in above-mentioned first aspect when being executed by processor based on FPGA The method for realizing YOLOv2 detection network.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect Fruit:
1. the present invention detects each layer of network of input feature vector figure to YOLOv2 and weight parameter carries out piecemeal processing, according to The size of computing resource and the on piece storage of FPGA selects piecemeal size.By reading parameter in batches and being calculated, will in Between result cache on piece storage on, after waiting the final result of this layer to have been calculated, write back in DRAM, to solve piece Upper resource and memory bandwidth limitation, can not be to the defect that flood is calculated.
2. for FPGA on piece low memory to store algorithm model parameter, present invention uses double buffers and Pile line operation is introduced between different layers, when a buffer stopper reads parameter from DRAM, another buffer stopper Data are passed to computing unit to handle, reduces every time from DRAM reading model parameter bring time delay, is greatling save The forward direction that can also be improved algorithm while required spatial cache infers speed, realize input traffic seamless buffering and Processing, to the full extent using memory headroom while make full use of rate logical resource.
Detailed description of the invention
Fig. 1 is a kind of method flow diagram that YOLOv2 detection network is realized based on FPGA provided in an embodiment of the present invention;
Fig. 2 is YOLO v2 network layer layered structure schematic diagram provided in an embodiment of the present invention;
Fig. 3 is processing unit structural schematic diagram provided in an embodiment of the present invention;
Fig. 4 is that two-dimensional convolution module provided in an embodiment of the present invention realizes process schematic;
Fig. 5 is a kind of system structure signal that YOLOv2 detection network is realized based on FPGA provided in an embodiment of the present invention Figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
As shown in Figure 1, the present invention proposes a kind of method for realizing YOLOv2 detection network based on FPGA, the method includes Following steps:
It S1., will according to CBL network block, CBLM network block, four kinds of networks of CL network block and M network block and two kinds of operations YOLOv2 detection network is divided into 23 layers and 2 operations, wherein CBL network block is by two-dimensional convolution layer, BN layers and activation primitive Leaky ReLU is composed in series;CBLM network block is composed in series by 1 CBL network block and 1 maximum value pond layer;CL network block It is composed in series by two-dimensional convolution layer and linear activation primitive;M network module is independent 1 maximum value pond layer;First operation be Reorg, the second operation are Concat;And construct processing unit in the following way: 1 BN module, Leaky ReLU module and Maximum value pond module is connected, cascaded structure and 1 Reorg wired in parallel, N_CI convolution of connecting before parallel-connection structure Core size is K*K two-dimensional convolution module;
S2. original input picture is divided into the first layer input picture block that multiple sizes are Tci*Tr*Tc, M network will be removed Each layer of weight parameter other than block is divided into the weight block that multiple sizes are Tc0*Tci*K*K, by all image blocks and weight Block is stored in external memory;
S3. current layer input picture block and corresponding weight block are successively loaded from external memory buffers mould to input Image block and weight block are loaded into m mutually independent processing units from input buffer module by block;
S4. all processing units carry out operation simultaneously, in each two-dimensional convolution module, obtained convolution results and storage Intermediate result in output buffer module is added to obtain accumulation result, if accumulation result is still intermediate result, control BN module, Leaky ReLU module and maximum value pond module are enabled invalid, and this accumulation result is stored in output buffering In module;If accumulation result is final result, then judges whether to need to carry out Reorg operation, if so, control BN module, Leaky ReLU module, maximum value pond module and Reorg module are enabled effectively, and two-way operation result is stored in output respectively In buffer module, finally the data for exporting buffer module are write back in external memory, so that the output by Reorg module As a result in the position that writes back of external memory, and then 21 layers of output result writes back position in external memory;Otherwise, it controls BN module, Leaky ReLU module and maximum value pond module processed are enabled effectively, and operation result is stored in output buffer module In, finally the data for exporting buffer module are write back in external memory;
S5. the input picture block of the final result of external memory as next layer will be write back to;
S6. step S3~S5 is repeated, until 23 layers of whole have been calculated.
Step S1. is operated according to CBL network block, CBLM network block, four kinds of networks of CL network block and M network block and two kinds, YOLOv2 detection network is divided into 23 layers and 2 operations, wherein CBL network block is by two-dimensional convolution layer, BN layers and activation primitive Leaky ReLU is composed in series;CBLM network block is composed in series by 1 CBL network block and 1 maximum value pond layer;CL network block It is composed in series by two-dimensional convolution layer and linear activation primitive;M network module is independent 1 maximum value pond layer;First operation be Reorg, the second operation are Concat;And construct processing unit in the following way: 1 BN module, Leaky ReLU module and Maximum value pond module is connected, cascaded structure and 1 Reorg wired in parallel, N_CI convolution of connecting before parallel-connection structure Core size is K*K two-dimensional convolution module.
Convolutional neural networks are mainly made of convolutional layer, BN layers, pond layer and activation primitive, and convolutional layer has calculating close The features such as collection, weight are shared, the computation-intensive calculation amount for referring to convolutional neural networks is all in convolutional layer substantially, and weight is shared It is that characteristic pattern (Feature Map) different location of input shares the weight of convolution kernel.The input data dimension of convolutional layer is It is three-dimensional, deconvolution parameter dimension be it is four-dimensional, output data be it is three-dimensional, its nucleus module is two-dimensional convolution.
As shown in Fig. 2, in order to introduce assembly line between convolutional layer, BN layers, activation primitive and pond layer, the present invention will YOLO v2 network layer type is divided into four kinds, the first is the combination (letter of convolutional layer, BN layers and activation primitive Leaky ReLU Claim CBL), second is convolutional layer, BN layers, activation primitive Leaky ReLU and Max pooling (abbreviation CBLM), the third is The combination (abbreviation CL) of convolutional layer and linear activation primitive, the calculation amount of these three network layers are all located in convolutional layer.4th kind It is an independent maximum value pond layer (abbreviation M).After being divided to YOLO v2 by above-mentioned network layer type, YOLO v2's The network number of plies is 23.Additionally to divide two kinds of action types: the stacking for rearranging (Reorg) and characteristic pattern of characteristic pattern (Concat)。
The calculation amount of YOLOv2 detection network is all located at convolutional layer.The calculating process of convolutional layer, there are four kinds of independence: same Each weight is independent from each other in one two-dimensional convolution core, the two-dimensional convolution of different input channels is independent from each other, is same It is mutually indepedent that different convolution windows, which is independent from each other with the two-dimensional convolution of different output channels, in a Feature Map 's.By four kinds of concurrencys in the available convolution process of these four independence.
The concurrency of pond layer and the concurrency of convolutional layer are very similar, can from input channel, pond window interior and Degree of parallelism is introduced in the window of same characteristic pattern difference pond.In pond, the pond window interior degree of parallelism of layer is 4, in same Zhang Te The degree of parallelism that sign schemes upper different pond windows is M, and the degree of parallelism introduced in input channel is N, thus total degree of parallelism for 4 × M×N。
In order to introduce degree of parallelism in input channel, each processing unit is made of multiple two dimensional convolvers.In order to increase The processing speed of design can carry out flowing water to activation primitive module and maximum value pond module when obtaining final convolution results Line operation.As shown in figure 3, processing unit includes N_CI two-dimensional convolution module, 1 BN module, Leaky ReLU module, maximum It is worth pond module and Reorg module.1 BN module, Leaky ReLU module and maximum value pond module are connected, tandem junction Structure and 1 Reorg wired in parallel, N_CI two-dimensional convolution module of connecting before parallel-connection structure, obtain processing unit.N_CI is preferred It is 4.
Convolutional layer conv directly utilizes FIFO (first in first out) and shift register to generate sliding window circuit, and by sliding Window circuit realizes that pond layer max pooling directly utilizes FIFO (first in first out) and shift LD with matrix dot product circuit Device is realized that activation primitive only needs fixed-point number multiplier to generate sliding window circuit by sliding window circuit and comparator It realizes, in Reorg layers of realization, according to the relevant bits of each pixel of input feature vector figure and the output each pixel of characteristic pattern Relationship is set, can be realized using single port RAM.
Assuming that two-dimensional convolution core size is K × K, input picture size is W × H, can use K depth greater than W's FIFO and shift register realize sliding window.In the present invention K include two kinds of values, respectively 3 and 1.It is big with convolution kernel Small is 3 × 3, for the two-dimensional convolution structure that input picture size is 5 × 5, as shown in figure 4, at this point, being greater than 5 using 3 depth FIFO and 3 × 3 shift register come realize size be 3 × 3 sliding window.
Maximum value pond is broadly divided into the generation of pond window compared with pixel value.Pond layer mainly by sliding window with And comparator is realized.Pond window generates the generation that can refer to upper trifle convolution window.And the maximum in the window of pond Value can be obtained by comparing circuit.Chi Huahe size is 2*2.
The activation primitive that YOLO v2 detects Web vector graphic is Leaky ReLU, its mathematic(al) representation is shown below:
Coefficient is 0.1, therefore unsaturation situation need to be only considered when the bit wide of result is truncated.Activation primitive only needs Fixed-point number multiplier can be realized.
Reorg layers for redistributing the characteristic pattern of input according to the relationship between input and the position of output Arrangement.Reorg layers can expand into an input channel four output channels, meanwhile, the size of the characteristic pattern of output For input 1/4.Reorg layers are realized on FPGA, can first seek the characteristic pattern of the characteristic pattern each pixel and input of output The correlativity of each pixel position, the positional relationship of the two are shown below:
Col=(count-1) %W
inindex=col+W* (row+H*Ci)
Wherein, count is the number for inputting pixel, CiFor current channel, W and H be respectively input feature vector figure width with Height, (row, col) are specific location of the pixel in input feature vector figure, FinFor pixel inlet flow, FoutFor pixel output stream. There is above-mentioned relation, can be realized Reorg layers using a single port RAM.During specific implementation, a meter can use Number device counts the number of input data, restores which port number C is input data be located at by count resultsiAnd The specific location (row, col) of input feature vector figure, because the wide W and high H of input feature vector figure are it is known that therefore according to above-mentioned formula Present input data corresponding address in RAM can be calculated.
Original input picture is divided into the first layer input picture block that multiple sizes are Tci*Tr*Tc by step S2., will remove M Each layer of weight parameter other than network block is divided into the weight block that multiple sizes are Tc0*Tci*K*K, by all image blocks and Weight block is stored in external memory.
Since on piece computing resource and memory space are all limited, YOLOv2 detection network is disposed on FPGA and is needed Its each layer is divided, by convolutional layer, BN layers, Leaky ReLU and pond layer as one layer, therefore is obtaining convolutional layer Result after can execute BN, Leaky ReLU and maximum value pond etc. operate, realize a pile line operation.So right Each layer of division that is to say and divide to convolutional layer.Piecemeal is carried out to each layer of input feature vector figure and weight parameter again.Root Piecemeal size is selected according to the size that the computing resource and on piece of FPGA stores.Step S1~S2 is to realize on FPGA The preparation of the input data of YOLOv2 network.
A convolutional layer is realized on FPGA, the present invention will originally input dimension as CH_IN × Rin × Cin, four-dimensional convolution Core dimension is CH_OUT × CH_IN × K × K, output dimension is that CH_OUT × R × C Three dimensional convolution is divided into multiple input dimensions Degree is Tci × TRin × TCin, four-dimensional convolution kernel dimension is Tco × Tci × K × K, output dimension is Tco × Tr × Tc three-dimensional Convolution, wherein Rin × Cin is the size of input feature vector figure, and R × C is the size for exporting characteristic pattern, the relationship of the two It can refer to formula.Dividing obtained Three dimensional convolution includes Tco × Tci two-dimensional convolution.
Step S3. successively loads current layer input picture block and corresponding weight block from external memory and buffers to input Image block and weight block are loaded into m mutually independent processing units from input buffer module by module.
In each layer in calculating process, because Resources on Chip and memory bandwidth limitation, can not calculate flood, Parameter can only be read in batches and be calculated.According to pixel coordinate by controller from external memory load pixel value, and according to Current input channel and output channel load corresponding parameter from external memory by controller.The preferred m=of the embodiment of the present invention 16.In order to introduce degree of parallelism in output channel, present invention employs multiple processing units to calculate different output channels simultaneously Two-dimensional convolution.
All processing units of step S4. carry out operation simultaneously, in each two-dimensional convolution module, obtained convolution results and The intermediate result being stored in output buffer module is added to obtain accumulation result, if accumulation result is still intermediate result, It is enabled invalid to control BN module, Leaky ReLU module and maximum value pond module, and this accumulation result is stored in output In buffer module;If accumulation result is final result, then judges whether to need to carry out Reorg operation, if so, control BN mould Block, Leaky ReLU module, maximum value pond module and Reorg module are enabled effectively, and two-way operation result is stored in respectively It exports in buffer module, finally writes back to the data for exporting buffer module in external memory, so that by Reorg module In the position that writes back of external memory, and then 21 layers of output result writes back position in external memory to output result;It is no Then, it is enabled effectively that BN module, Leaky ReLU module and maximum value pond module are controlled, operation result is stored in output buffering In module, finally the data for exporting buffer module are write back in external memory.
Intermediate result is buffered on piece storage, after waiting the final result of this layer to have been calculated, is write back to DRAM In.In order to reduce every time from DRAM reading model parameter bring time delay, the seamless buffering and processing of input traffic are realized, Present invention uses Double buffer block mechanism, and when a buffer stopper reads parameter from DRAM, another buffer stopper will be counted It is handled according to computing unit is passed to.
It is to realize that Concat is operated to whether needing to carry out the judgement of Reorg operation.
Input buffer module and output buffer module all employ double buffers, that is to say ping-pong operation.It is inputting In buffer module, when data are transported to computing module by one of buffer to be calculated, another block buffer warp DMA carries data from external memory, is so used alternatingly;In output buffer module, when one of buffer stores Operation result is write back to external storage through DMA by the buffer of intermediate operation result, the final operation result of another piece of storage In device, so it is used alternatingly.The seamless buffering and processing of data may be implemented using double buffer module.Each buffer module Containing two panels buffer stopper, wherein the size Tci*Tco*K*K+4*Tco+Tci*Tr*Tc of a piece of input buffer stopper, a piece of output are slow The size for rushing block is Tco*Tr*Tc.
Step S5. will write back to the input picture block of the final result of external memory as next layer.
The output image block of current layer is next layer of input picture block.
Step S6. repeats step S3~S5, until 23 layers of whole have been calculated.
FPGA platform cannot support YOLOv2 to detect the operation of network whole simultaneously, therefore can only successively calculate.It will be final It exports result and carries out non-maxima suppression, obtain the optimum prediction frame of each target.
As shown in figure 5, a kind of system that YOLOv2 detection network is realized based on FPGA, the system comprises: processing system PS and programmable logic PL;
Processing system PS includes central processing unit and external memory, and central processing unit is responsible for dispatching YOLOv2 detection To the process of deduction and configuration DMA before network, and non-pole is carried out to by the last operation result of YOLOv2 detection network Big value inhibition obtains the classification of each target and its corresponding position in image;External memory is responsible for storing YOLOv2 detection The model parameter and image data of network;
Programmable logic PL is by dma module (Direct Memory Access, direct memory access), controller, input Buffer module, output buffer module, decoder module and six part of computing module composition.
Dma module is the transmission that data and instruction are carried out between the end PS and the piece upper bumper at the end PL.
Controller from external memory acquisition instruction and dispatch input buffer module, output buffer module, decoder module And computing module.Specifically, controller successively loads current layer input picture block and corresponding weight from external memory Block is loaded into m mutually independent processing units from input buffer module to input buffer module, by image block and weight block;Institute There is processing unit while carrying out operation, in each two-dimensional convolution module, obtained convolution results buffer mould with output is stored in Intermediate result in block is added to obtain accumulation result, if accumulation result is still intermediate result, controls BN module, Leaky ReLU module and maximum value pond module are enabled invalid, and this accumulation result is stored in output buffer module;If tired Adding result is final result, then judges whether to need to carry out Reorg operation, if so, control BN module, Leaky ReLU module, Maximum value pond module and Reorg module are enabled effectively, two-way operation result are stored in respectively in output buffer module, finally The data for exporting buffer module are write back in external memory, so that the output result by Reorg module is in external storage And then 21 layers of output result writes back position in external memory for the position that writes back of device;Otherwise, BN module, Leaky are controlled ReLU module and maximum value pond module are enabled effectively, and operation result is stored in output buffer module, finally that output is slow The data of die block write back in external memory;The input figure of the final result of external memory as next layer will be write back to As block.
Computing module be responsible for YOLOv2 detection network before to deduction calculating, including convolutional layer, BN layers, Leaky ReLU with And the calculating in maximum value pond.
More than, the only preferable specific embodiment of the application, but the protection scope of the application is not limited thereto, and it is any Within the technical scope of the present application, any changes or substitutions that can be easily thought of by those familiar with the art, all answers Cover within the scope of protection of this application.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims (8)

1. a kind of method for realizing YOLOv2 detection network based on FPGA, which is characterized in that the described method comprises the following steps:
S1. according to CBL network block, CBLM network block, four kinds of networks of CL network block and M network block and two kinds of operations, by YOLOv2 Detection network is divided into 23 layers and 2 operations, wherein CBL network block is by two-dimensional convolution layer, BN layers and activation primitive Leaky ReLU It is composed in series;CBLM network block is composed in series by 1 CBL network block and 1 maximum value pond layer;CL network block is by two-dimensional convolution Layer and linear activation primitive are composed in series;M network module is independent 1 maximum value pond layer;First operation be Reorg, second Operation is Concat;And processing unit: 1 BN module, Leaky ReLU module and maximum value pond is constructed in the following way Module is connected, cascaded structure and 1 Reorg wired in parallel, and N_CI convolution kernel size of connecting before parallel-connection structure is K*K Two-dimensional convolution module;
S2. by original input picture be divided into multiple sizes be Tci*Tr*Tc first layer input picture block, will except M network block with Each layer outer of weight parameter is divided into the weight block that multiple sizes are Tc0*Tci*K*K, and all image blocks and weight block are deposited It is placed on external memory;
S3. current layer input picture block and corresponding weight block are successively loaded from external memory to input buffer module, are incited somebody to action Image block and weight block are loaded into m mutually independent processing units from input buffer module;
S4. all processing units carry out operation simultaneously, in each two-dimensional convolution module, obtained convolution results and are stored in defeated The intermediate result in buffer module is added to obtain accumulation result out, if accumulation result is still intermediate result, controls BN mould Block, Leaky ReLU module and maximum value pond module are enabled invalid, and this accumulation result is stored in output buffer module In;If accumulation result is final result, then judges whether to need to carry out Reorg operation, if so, control BN module, Leaky ReLU module, maximum value pond module and Reorg module are enabled effectively, and two-way operation result is stored in output buffering mould respectively In block, finally the data for exporting buffer module are write back in external memory, so that the output result by Reorg module exists And then 21 layers of output result writes back position in external memory for the position that writes back of external memory;Otherwise, BN mould is controlled Block, Leaky ReLU module and maximum value pond module are enabled effectively, operation result are stored in output buffer module, finally The data for exporting buffer module are write back in external memory;
S5. the input picture block of the final result of external memory as next layer will be write back to;
S6. step S3~S5 is repeated, until 23 layers of whole have been calculated.
2. the method as described in claim 1, which is characterized in that two-dimensional convolution module is by the dot product of sliding window and matrix Lai real It is existing;Maximum value pond module is realized by sliding window and comparator;Leaky ReLU module is realized by fixed-point number multiplier;According to The relevant position of input data and output data realizes that Reorg is operated using single port RAM.
3. the method as described in claim 1, which is characterized in that selected according to the size that the computing resource of FPGA and on piece store Select piecemeal size.
4. the method as described in claim 1, which is characterized in that the input buffer module and output buffer module all use Double buffers, when a buffer stopper reads parameter from DRAM, data are passed to calculating by another buffer stopper Unit is handled.
5. the method as described in claim 1, which is characterized in that in input buffer module, when one of buffer will count According to when being transported to computing module and calculated, another block buffer carries data through DMA from external memory, so hands over For use;In output buffer module, when the operation result that the storage of one of buffer is intermediate, the final fortune of another piece of storage The buffer for calculating result writes back to operation result in external memory through DMA, is so used alternatingly;Each buffer module contains There is two panels buffer stopper, wherein the size Tci*Tco*K*K+4*Tco+Tci*Tr*Tc of a piece of input buffer stopper, a piece of output buffering The size of block is Tco*Tr*Tc.
6. the method as described in claim 1, which is characterized in that the method also includes: it, will be final defeated after step S6 Result carries out non-maxima suppression out, obtains the optimum prediction frame of each target.
7. a kind of system for realizing YOLOv2 detection network based on FPGA, which is characterized in that the system uses such as claim 1 To 6 described in any item methods for realizing YOLOv2 detection network based on FPGA.
8. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program is realized when the computer program is executed by processor and is realized as claimed in any one of claims 1 to 6 based on FPGA The method of YOLOv2 detection network.
CN201910280748.4A 2019-04-09 2019-04-09 Method and system for realizing YOLOv2 detection network based on FPGA Active CN110175670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910280748.4A CN110175670B (en) 2019-04-09 2019-04-09 Method and system for realizing YOLOv2 detection network based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910280748.4A CN110175670B (en) 2019-04-09 2019-04-09 Method and system for realizing YOLOv2 detection network based on FPGA

Publications (2)

Publication Number Publication Date
CN110175670A true CN110175670A (en) 2019-08-27
CN110175670B CN110175670B (en) 2020-12-08

Family

ID=67689598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910280748.4A Active CN110175670B (en) 2019-04-09 2019-04-09 Method and system for realizing YOLOv2 detection network based on FPGA

Country Status (1)

Country Link
CN (1) CN110175670B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717588A (en) * 2019-10-15 2020-01-21 百度在线网络技术(北京)有限公司 Apparatus and method for convolution operation
CN111459877A (en) * 2020-04-02 2020-07-28 北京工商大学 FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method
CN111860781A (en) * 2020-07-10 2020-10-30 逢亿科技(上海)有限公司 Convolutional neural network feature decoding system realized based on FPGA
CN111967572A (en) * 2020-07-10 2020-11-20 逢亿科技(上海)有限公司 FPGA-based YOLO V3 and YOLO V3 Tiny network switching method
CN113139519A (en) * 2021-05-14 2021-07-20 陕西科技大学 Target detection system based on fully programmable system on chip
CN113495786A (en) * 2020-03-19 2021-10-12 杭州海康威视数字技术股份有限公司 Image convolution processing method and electronic equipment
CN115049907A (en) * 2022-08-17 2022-09-13 四川迪晟新达类脑智能技术有限公司 FPGA-based YOLOV4 target detection network implementation method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154229A (en) * 2018-01-10 2018-06-12 西安电子科技大学 Accelerate the image processing method of convolutional neural networks frame based on FPGA
EP3352113A1 (en) * 2017-01-18 2018-07-25 Hitachi, Ltd. Calculation system and calculation method of neural network
WO2018184192A1 (en) * 2017-04-07 2018-10-11 Intel Corporation Methods and systems using camera devices for deep channel and convolutional neural network images and formats
CN108805274A (en) * 2018-05-28 2018-11-13 重庆大学 The hardware-accelerated method and system of Tiny-yolo convolutional neural networks based on FPGA
CN109214504A (en) * 2018-08-24 2019-01-15 北京邮电大学深圳研究院 A kind of YOLO network forward inference accelerator design method based on FPGA
CN109447893A (en) * 2019-01-28 2019-03-08 深兰人工智能芯片研究院(江苏)有限公司 A kind of convolutional neural networks FPGA accelerate in image preprocessing method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3352113A1 (en) * 2017-01-18 2018-07-25 Hitachi, Ltd. Calculation system and calculation method of neural network
WO2018184192A1 (en) * 2017-04-07 2018-10-11 Intel Corporation Methods and systems using camera devices for deep channel and convolutional neural network images and formats
CN108154229A (en) * 2018-01-10 2018-06-12 西安电子科技大学 Accelerate the image processing method of convolutional neural networks frame based on FPGA
CN108805274A (en) * 2018-05-28 2018-11-13 重庆大学 The hardware-accelerated method and system of Tiny-yolo convolutional neural networks based on FPGA
CN109214504A (en) * 2018-08-24 2019-01-15 北京邮电大学深圳研究院 A kind of YOLO network forward inference accelerator design method based on FPGA
CN109447893A (en) * 2019-01-28 2019-03-08 深兰人工智能芯片研究院(江苏)有限公司 A kind of convolutional neural networks FPGA accelerate in image preprocessing method and device

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DING, CW 等: "REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs", 《PROCEEDINGS OF THE 2019 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS》 *
DUY THANH NGUYEN 等: "A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection", 《IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS》 *
WAI, Y.J.等: "Fixed Point Implementation of Tiny-Yolo-v2 using OpenCL on FPGA", 《INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS》 *
张霎轲 等: "基于小型Zynq SoC硬件加速的改进TINY YOLO实时车辆检测算法实现", 《计算机应用》 *
段秉环 等: "面向嵌入式应用的深度神经网络压缩方法研究", 《航空计算技术》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11556614B2 (en) 2019-10-15 2023-01-17 Apollo Intelligent Driving Technology (Beijing) Co., Ltd. Apparatus and method for convolution operation
CN110717588B (en) * 2019-10-15 2022-05-03 阿波罗智能技术(北京)有限公司 Apparatus and method for convolution operation
CN110717588A (en) * 2019-10-15 2020-01-21 百度在线网络技术(北京)有限公司 Apparatus and method for convolution operation
CN113495786A (en) * 2020-03-19 2021-10-12 杭州海康威视数字技术股份有限公司 Image convolution processing method and electronic equipment
CN113495786B (en) * 2020-03-19 2023-10-13 杭州海康威视数字技术股份有限公司 Image convolution processing method and electronic equipment
CN111459877A (en) * 2020-04-02 2020-07-28 北京工商大学 FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method
CN111459877B (en) * 2020-04-02 2023-03-24 北京工商大学 Winograd YOLOv2 target detection model method based on FPGA acceleration
CN111860781A (en) * 2020-07-10 2020-10-30 逢亿科技(上海)有限公司 Convolutional neural network feature decoding system realized based on FPGA
CN111967572A (en) * 2020-07-10 2020-11-20 逢亿科技(上海)有限公司 FPGA-based YOLO V3 and YOLO V3 Tiny network switching method
CN113139519A (en) * 2021-05-14 2021-07-20 陕西科技大学 Target detection system based on fully programmable system on chip
CN113139519B (en) * 2021-05-14 2023-12-22 陕西科技大学 Target detection system based on fully programmable system-on-chip
CN115049907A (en) * 2022-08-17 2022-09-13 四川迪晟新达类脑智能技术有限公司 FPGA-based YOLOV4 target detection network implementation method
CN115049907B (en) * 2022-08-17 2022-10-28 四川迪晟新达类脑智能技术有限公司 FPGA-based YOLOV4 target detection network implementation method

Also Published As

Publication number Publication date
CN110175670B (en) 2020-12-08

Similar Documents

Publication Publication Date Title
CN110175670A (en) A kind of method and system for realizing YOLOv2 detection network based on FPGA
CN109815886B (en) Pedestrian and vehicle detection method and system based on improved YOLOv3
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN108985450B (en) Vector processor-oriented convolution neural network operation vectorization method
CN109214504B (en) FPGA-based YOLO network forward reasoning accelerator design method
CN108171317A (en) A kind of data-reusing convolutional neural networks accelerator based on SOC
KR20180034557A (en) Improving the performance of a two-dimensional array processor
CN106897143A (en) Area's piece distribution to the treatment engine in graphic system
US20210019594A1 (en) Convolutional neural network accelerating device and method
KR20180123846A (en) Logical-3d array reconfigurable accelerator for convolutional neural networks
CN110533022B (en) Target detection method, system, device and storage medium
CN105825468A (en) Graphics processing unit and graphics processing method thereof
CN113408423A (en) Aquatic product target real-time detection method suitable for TX2 embedded platform
Li et al. High throughput hardware architecture for accurate semi-global matching
CN113743505A (en) Improved SSD target detection method based on self-attention and feature fusion
CN110598844A (en) Parallel convolution neural network accelerator based on FPGA and acceleration method
CN109472734B (en) Target detection network based on FPGA and implementation method thereof
JP2021515339A (en) Machine perception and high density algorithm integrated circuits
CN112101113B (en) Lightweight unmanned aerial vehicle image small target detection method
CN117217274A (en) Vector processor, neural network accelerator, chip and electronic equipment
CN112149518A (en) Pine cone detection method based on BEGAN and YOLOV3 models
CN108197613B (en) Face detection optimization method based on deep convolution cascade network
CN115640833A (en) Accelerator and acceleration method for sparse convolutional neural network
CN113902904A (en) Lightweight network architecture system
CN111832336B (en) Improved C3D video behavior detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant