CN109214504A - A kind of YOLO network forward inference accelerator design method based on FPGA - Google Patents
A kind of YOLO network forward inference accelerator design method based on FPGA Download PDFInfo
- Publication number
- CN109214504A CN109214504A CN201810970836.2A CN201810970836A CN109214504A CN 109214504 A CN109214504 A CN 109214504A CN 201810970836 A CN201810970836 A CN 201810970836A CN 109214504 A CN109214504 A CN 109214504A
- Authority
- CN
- China
- Prior art keywords
- bram
- layer
- network
- value
- design method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Complex Calculations (AREA)
- Image Analysis (AREA)
Abstract
The YOLO network forward inference accelerator design method based on FPGA that the invention proposes a kind of, the accelerator includes fpga chip and DRAM, and the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage;The accelerator design method pinpoints quantification the following steps are included: (1) carries out 8bit to former network data, obtains influencing the smallest scaling position to detection accuracy, forms quantization scheme, which successively carries out;(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;(3) position maps.The growth rate of the storage resource on fpga chip in the prior art is solved not as good as scale of neural network rapid development, general target detection network is difficult the technical problem being transplanted on fpga chip according to traditional mentality of designing, realizes and achievees the purpose that faster speed using less Resources on Chip.
Description
Technical field
The present invention relates to the technical fields of deep learning and Design of Hardware Architecture, more particularly, to a kind of target detection network
The design method of forward inference acceleration is carried out on FPGA.
Background technique
In recent years, it is based on the machine learning algorithm of convolutional neural networks (Convolutional Neutral Network)
It has been widely applied in the task of computer vision.It is computation-intensive but for large-scale CNN network, it deposits
Storage is intensive, and the big feature of resource consumption brings huge challenge to above-mentioned task.Traditional general processor is in face of this kind of
When height calculating pressure and big data throughput, performance is extremely difficult to practical requirement, therefore the hardware based on GPU, FPGA, ASIC
Accelerator is suggested and widely comes into operation.
FPGA (Field Programmable Gate Array) field programmable gate array, is in PAL, GAL, EPLD
The product further developed on the basis of equal programming devices.It is that occur as one of field ASIC semi-custom circuit
, that is, solve the deficiency of custom circuit, and overcome the limited disadvantage of original programming device gate circuit number amount.FPGA is used
Logical cell array LCA (Logic Cell Array) such a new concept, inside include configurable logic blocks CLB,
Three parts output input module IOB and interconnector can support a PROM programming multiple FPGA.Due to flexibly may be used
Ability and outstanding power dissipation ratio of performance are reconfigured, FPGA is made to become a kind of current important deep learning processor.
Being suitble to hard-wired mainstream target detection network at present is YOLO (You Only Look Once), this network
Fast speed and structure is simple, this algorithm object detection issue handling at regression problem, with a convolutional neural networks knot
Structure can directly predict position and the class probability of target frame from input picture, realize object detection end to end, this knot
Structure is relatively more suitable for the hardware realization on FPGA.It is a kind of general fixed based on FPGA to disclose in existing invention CN107392309A
Points neural network convolution accelerator hardware structure, comprising: general AXI4 high speed bus interface, high parallel-convolution core and characteristic pattern
Data buffer area, convolutional calculation device, caches area controller, state controller and is directly accessed segmented convolution results buffer area
Controller.The invention uses on piece storage as buffering, and main storage of the memory outside piece as data passes through one
General processor outside piece carry out memory management to complete the calculating of entire convolutional network, the design of this structure only uses one
Piece FPGA is unable to complete the forward inference of target detection network.A kind of convolutional Neural is proposed in existing invention CN107463990A
The FPGA parallel acceleration method of network includes the following steps: that (1) establishes CNN model;(2) hardware structure is configured;(3) configuration volume
Product arithmetic element.The invention loads the interim calculated result of whole network using on piece storage, therefore the network rule that can be disposed
Mould is limited.
The results of intermediate calculations of network layer, is often all stored on piece by the existing neural network accelerator based on FPGA
Static memory in, weight required for network is stored in the dynamic memory outside piece, and such design will lead on piece
Storage space limits the network size that can accelerate.At this stage, as the demand of task complexity and precision is got higher, volume
Product scale of neural network is increasing, and parameter total amount is also increasing, but the technique of fpga chip and on piece is open ended deposits
The growth of resource is stored up but without so rapid, if still according to design method before, FPGA cannot accommodate this rule completely
The network of mould.
If using the static memory BRAM of on piece as the buffer area of data, and the dynamic memory DRAM outside piece
Key data as network stores, and since the memory space of dynamic memory is huge, it is very big can to accommodate number of parameters
Network realizes the parallel computation of each convolution module by reasonably distributing the bandwidth of memory.The property of this design method
The bandwidth of memory can be depended on, but stacks storage resource compared on piece, the bandwidth for promoting communication is more easily to realize
's.Network referenced by the present invention is the version of YOLO-tiny, the input size of this network is 416*416*3, network totally 9
Layer convolution, final output are the candidate frame with classification, position and confidence information, and pass through area maps (region behaviour
Make) algorithm is mapped to calculated result in original image.
Summary of the invention
It is fast not as good as scale of neural network growth for the growth rate of the storage resource in solution in the prior art fpga chip
Speed, general target detection network are difficult the technical problem being transplanted on fpga chip according to traditional mentality of designing, the present invention
A kind of YOLO network forward inference accelerator based on FPGA is proposed for the development platform of YOLO-tiny network and KC705,
Specific technical solution is as follows:
A kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip and
DRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAM
It is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtain influencing the smallest scaling position to detection accuracy,
Quantization scheme is formed, which successively carries out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
Specifically, in the step (1) a certain layer quantizing process are as follows:
A) quantify the weight data of former network: a certain scaling position for first establishing 8bit fixed-point number is 256 kinds corresponding
Metric value reuses nearby principle and quantifies to initial data, the numerical value after quantization is still wherein including positive zero and negative zero
It is indicated the detection accuracy for obtaining such quantization scheme in order to calculate using the floating type of 32bit, then traverses 8 kinds small
Number point, which postpones to obtain, influences the smallest scaling position to detection accuracy, eventually forms the weight quantization scheme of this layer;
B) distribution of 0-1 is first normalized to input feature vector figure, is then somebody's turn to do using method described in step a)
The quantization of layer input feature vector figure;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction of all pictures is only carried out to this layer of convolution
It propagates, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, it finally obtains all
The quantization scheme of layer.
Specifically, in the step (2) each layer of convolutional network calculating process are as follows:
A) weighted data required for epicycle calculates is read from DRAM, is placed into BRAM;
B) feature diagram data (FM) of this layer to convolution is read, all input datas is completed and prepares;
C) convolutional calculation is carried out, after the completion of a wheel convolutional calculation, the data in BRAM are uploaded in DRAM again, clearly
Empty interim findings data, then start the calculating of next round.
Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAM
In a progress convolutional calculation, by obtained convolution results be added to switching input channel after convolutional calculation in, every time plus
The input feature vector figure needs of load can just switch next input feature vector region after having been calculated one time with all convolution kernels.
It specifically, further include needing to carry out pond when calculating the final result for arriving a certain output channel in the step (2)
Change operation and activation operation, detailed process is as follows, and when calculating convolution results one by one to certain a line, this line result is drawn two-by-two
Point, and maximum value in two values is recorded, it is saved using the logical resource on fpga chip, arrives next line when calculating
When, output result is divided two-by-two equally, takes the larger value therein, and carry out with the maximum value elected in lastrow
Compare, using that value bigger in the two maximum values as the maximum value in a certain region 2*2, then with RELU activation primitive
Threshold value be compared, result is saved in BRAM, in this way when the convolution for carrying out final result to a certain channel of output
Afterwards, while also the pondization operation and activation operation in the channel are completed.
Step (2) a) and b) in BRAM be set as 512 data widths, depth design is 512 points, one piece of BRAM
7.5 RAMB36E1 are consumed, output minimum is set as 16bit;C) the dual-port mode that comes true is arranged in the BRAM in, and port width is
16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
Specifically, in the step (2) weighted data storage scheme are as follows: 1-3 layers of convolutional network share one piece of BRAM,
Consume 7.5 RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5
RAM36E1;9th layer uses one piece of BRAM, consumes 7.5 RAM36E1.
Specifically, in the step (2) feature diagram data storage scheme are as follows: for level 1 volume product network in a) and b)
Using one piece of BRAM, 2-6 layers of every layer of convolutional network respectively uses two pieces of BRAM, and the 7th layer uses eight pieces of BRAM, the 8th layer of use
Ten pieces of BRAM, the 9th layer uses nine pieces of BRAM;One piece of BRAM is used for every layer in c);Every piece of BRAM consumes 7.5 RAM36E1.
Specifically, the output of the convolutional network includes the location information of 13*13*5 candidate frame, the position of each candidate frame
Confidence breath is made of x, y, w, h value, respectively indicates the abscissa relative value of candidate frame central point, ordinate relative value, width phase
To value, height relative value;For horizontal, ordinate relative value through sigmoid Function Mapping into absolute coordinate, wide, high coordinate is opposite
Value is mapped in absolute value by e index.
The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operation
It is as follows:
A) center point coordinate of each candidate frame is extracted in order, and flag bit is set to each candidate frame, for indicating
Whether entire candidate frame retains.
B) selecting first candidate frame is by the central point distance of comparison other candidate frame behind comparison other computational domain, when super
When crossing a threshold value, that the candidate frame flag bit compared is effective status, indicates that this candidate frame needs retain, no
Then be the mark position of the frame it is invalid, the comparison of subsequent distance is no longer participate in, when traversing the last of queue by comparison other
At one, comparison other, i.e., the effective candidate frame of next flag bit of comparison other just now are replaced.
C) the effective candidate frame of all flag bits is extracted from result memory, and generates a marking frame printing
Into original image as final testing result.
The invention has the following advantages:
One, the present invention is using the memory on fpga chip as the data buffer zone of convolutional calculation, depositing outside fpga chip
Reservoir as main storage equipment, each convolutional layer by piece outside memory be coupled, this design method is more than
Suitable for YOLO network, it is equally applicable to other neural network.
Two, the resource allocation methods for each layer convolutional calculation that the present invention is carried out can play to greatest extent whole network simultaneously
Row calculate ability, compare with serial convolutional calculation structure, the Resources on Chip that the design of this programme uses is less, forward inference
Speed faster.
Three, on fpga chip, without direct data interaction between each layer, the connection of each layer belongs to loose coupling relation,
It can guarantee the stability of system in this way.
Four, the calculating for accelerating whole network present invention uses simple version is carried out without using the method for overlapping area
It calculates but is simplified with the central point distance of two frames, can greatly improve the speed of NMS step.
Detailed description of the invention
Fig. 1 is the calculating structure and storage organization schematic diagram of each layer of the present invention
Fig. 2 is single layer network calculation flow chart of the present invention
Specific embodiment
Embodiment 1
A kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip and
DRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAM
It is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtain influencing the smallest scaling position to detection accuracy,
Quantization scheme is formed, which successively carries out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
Specifically, in the step (1) a certain layer quantizing process are as follows:
A) quantify the weight data of former network: when being quantified according to a certain decimal position of 8bit fixed-point number, first building
The decimal value table of comparisons under the position, i.e. 256 kinds of decimal numbers are found, wherein including positive zero and negative zero, then using former nearby
Then initial data is quantified, after quantization although value changes, but data are still the floating type of 32bit, convenient for below
It is calculated in GPU, obtains the detection accuracy of such quantization scheme, obtained after then traversing 8 kinds of scaling positions to detection
Precision influences the smallest scaling position, eventually forms the weight quantization scheme of this layer;
B) all integrated test input feature vector figures are first normalized with the distribution of 0-1, then using described in step a)
Method carry out the quantization of this layer of input feature vector figure;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction of all pictures is only carried out to this layer of convolution
It propagates, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, it finally obtains all
The quantization scheme of layer.
Specifically, in the step (2) each layer of convolutional network calculating process are as follows: first from DRAM read epicycle calculate
Required weighted data is placed into weight buffer BRAM;Then feature diagram data (FM) of this layer to convolution is read, it is complete
Start to carry out convolutional calculation after preparing at all input datas, after the completion of a wheel convolutional calculation, in results buffer BRAM
Data upload in DRAM again, empty interim findings data, then start the calculating of next round.Due to next layer of calculating
One layer of calculated result is relied on, is mutually waited to allow every layer can be calculated simultaneously, table tennis is used in DRAM
Structure, to play the computation capability in FPGA.On fpga chip, without direct data interaction, each layer between each layer
Connection belong to loose coupling relation, can guarantee the stability of system in this way.
Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAM
In a progress convolutional calculation, since BRAM is resource-constrained on fpga chip, and the size of this layer of picture is bigger, therefore only
Continuous several rows in first Loading Image, according to the principle of convolutional calculation, the convolution results of this several row are final output at this time
The interim findings of corresponding region (that several row of same position) in a certain channel arrive phase calculating behind the channel of switching input
When with convolution at position, need to add up with interim findings before, so before this layer of module executes convolutional calculation,
The interim convolution calculated result first before fetching in DDR from same position corresponding to the output channel is needed, in this way every
After secondary convolution module calculates result, it can be added and be stored in result memory BRAM again with the value in result memory BRAM.
The input feature vector figure needs loaded every time can just switch next input feature vector region after having been calculated one time with all convolution kernels.
It specifically, further include needing to carry out pond when calculating the final result for arriving a certain output channel in the step (2)
Change operation and activation operation, detailed process is as follows, and when calculating convolution results one by one to certain a line, this line result is drawn two-by-two
Point, and maximum value in two values is recorded, it is saved using the logical resource on fpga chip, arrives next line when calculating
When, output result is divided two-by-two equally, takes the larger value therein, and carry out with the maximum value elected in lastrow
Compare, using that value bigger in the two maximum values as the maximum value in a certain region 2*2, then with RELU activation primitive
Threshold value be compared, result is saved in BRAM, in this way when the convolution for carrying out final result to a certain channel of output
Afterwards, while also the pondization operation and activation operation in the channel are completed.
The step (2) needs to receive the data read out in DRAM as the BRAM of data buffer, in order to play
The maximum bandwidth of DRAM, the write-in end of BRAM are set as 512 data widths, and depth design is 512 points, one piece of BRAM consumption
7.5 RAMB36E1, output minimum are set as 16bit, the input width as convolution operation;The BRAM of buffer as a result
It to be not only also written in DRAM from data are read in DRAM simultaneously, therefore the dual-port mode that comes true is set, port width is
16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
Specifically, in the step (2) weighted data storage scheme are as follows: 1-3 layers of convolutional network share one piece of BRAM,
Consume 7.5 RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5
RAM36E1;9th layer uses one piece of BRAM, consumes 7.5 RAM36E1.
Specifically, in the step (2) feature diagram data storage scheme are as follows: for Input Data Buffer, level 1 volume
Product one piece of BRAM of Web vector graphic, 2-6 layer of every layer of convolutional network respectively use two pieces of BRAM, the 7th layer use eight pieces of BRAM, the 8th
Layer uses ten pieces of BRAM, and the 9th layer uses nine pieces of BRAM;For output data buffer, every layer uses one piece of BRAM;Every piece of BRAM
7.5 RAM36E1 are consumed, total characteristic pattern data buffer needs 337.5 RAM36E1.Since BRAM is resource-constrained, only exist
Do ping-pong operation at output buffer, data of each layer in input buffer be not ready to before without convolutional calculation.
According to every layer of multiply-add calculation amount come etc. ratios distribution every layer of parallel computation port number and every layer of corresponding parallel channel number
It is shown in Table 1, for the parallel channel of input, each channel requires one piece of individual BRAM and stores, but result cache
Device only needs the big BRAM such as a piece of to store.
The distribution of every layer of calculation amount ratio of 1 parallel computation of table and every layer of parallel channel number
Layer | One | Two | Three | Four | Five | Six | Seven | Eight | Nine |
Ratio | 1 | 2.5 | 2.5 | 2.5 | 2.5 | 2.5 | 10 | 20 | 1 |
PE quantity | 1 | 2 | 2 | 2 | 2 | 2 | 8 | 16 | 1 |
Specifically, conventional part is followed by the region layer operation of position mapping, and the output of the convolutional network includes 13*
The location information of the location information of 13*5 candidate frame, each candidate frame is made of x, y, w, h value, respectively indicates candidate frame center
The abscissa relative value of point, ordinate relative value, width relative value, height relative value;This four values are needed by some processing
It can just be mapped in actual Pictures location, horizontal, ordinate relative value passes through sigmoid Function Mapping into absolute coordinate,
Due to output the result is that 8bit fixed-point representation, corresponding output result can quantify into a look-up table, accelerate this
Mapping process;Wide, high coordinate relative value is mapped in absolute value by e index, and the form of equally applicable look-up table obtains result.
The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operation
It is as follows: first to extract the center point coordinate of each candidate frame in order, and flag bit is set to each candidate frame, for indicating entire
Whether candidate frame retains;Due to using central point distance to be judged for index, according to prior information it is found that in network output
In candidate frame, the closer frame of sequence is the object compared, and sequence is farther away can be ignored and compare, and then selects first candidate frame to be
By the central point distance of comparison other candidate frame behind comparison other computational domain, when more than a threshold value, compared that
Candidate frame flag bit be effective status, indicate this candidate frame needs retain, be otherwise the mark position of the frame it is invalid,
It is no longer participate in the comparison of subsequent distance, when traversing the last one of queue by comparison other, replaces comparison other, i.e., just now
Comparison other the effective candidate frame of next flag bit;Finally the effective candidate frame of all flag bits from result memory
In extract, and generate a marking frame print in original image as final testing result.
The above embodiments and description only illustrate the principle of the present invention, is not departing from spirit of that invention and model
Under the premise of enclosing, various changes and improvements may be made to the invention, these changes and improvements both fall within claimed invention model
In enclosing.The scope of the present invention is defined by the appended claims and its equivalents.
Claims (10)
1. a kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip and
DRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAM
It is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtained influencing the smallest scaling position to detection accuracy, be formed
Quantization scheme, the quantizing process successively carry out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
2. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature
It is, the quantizing process of a certain layer in the step (1) are as follows:
A) quantify the weight data of former network: first establish 8bit fixed-point number a certain scaling position it is corresponding 256 kind ten into
The value of system reuses nearby principle and quantifies to initial data, the numerical value after quantization still uses wherein including positive zero and negative zero
The floating type of 32bit is indicated to calculate, and obtains the detection accuracy of such quantization scheme, then traverses 8 kinds of decimal points
It obtains influencing the smallest scaling position to detection accuracy behind position, eventually forms the weight quantization scheme of this layer;
B) distribution of 0-1 input feature vector figure is first normalized, then using method described in step a) carry out this layer it is defeated
Enter the quantization of characteristic pattern;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction for only carrying out all pictures to this layer of convolution is passed
It broadcasts, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, finally obtain all layers
Quantization scheme.
3. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature
It is, the calculating process of each layer of convolutional network in the step (2) are as follows:
A) weighted data required for epicycle calculates is read from DRAM, is placed into BRAM;
B) feature diagram data (FM) of this layer to convolution is read, all input datas is completed and prepares;
C) convolutional calculation is carried out, after the completion of a wheel convolutional calculation, the data in BRAM is uploaded in DRAM again, empties and faces
When result data, then start the calculating of next round.
4. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature
Be, when carrying out first layer convolution in the step (2), first from one loaded in DRAM in input feature vector figure triple channel into
Obtained convolution results are added in the convolutional calculation after switching input channel by row convolutional calculation, and the input loaded every time is special
Sign figure needs can just switch next input feature vector region after having been calculated one time with all convolution kernels.
5. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature
It is, further includes needing to carry out pondization operation when calculating the final result for arriving a certain output channel and swashing in the step (2)
Operation living, detailed process is as follows, when calculating convolution results one by one to certain a line, this line result is divided two-by-two, and two
Maximum value is recorded in a value, is saved using the logical resource on fpga chip, when next line is arrived in calculating, together
Sample divides output result two-by-two, the larger value therein is taken, and be compared with the maximum value elected in lastrow, this
That bigger value is as the maximum value in a certain region 2*2 in two maximum values, then with the threshold value of RELU activation primitive
It is compared, result is saved in BRAM, in this way after carrying out the convolution of final result to a certain channel of output, while
Complete the pondization operation and activation operation in the channel.
6. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature
Be, step (2) a) and b) in BRAM be set as 512 data widths, depth design is 512 points, and one piece of BRAM disappears
7.5 RAMB36E1 are consumed, output minimum is set as 16bit;C) the dual-port mode that comes true is arranged in the BRAM in, and port width is
16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
7. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature
It is, the storage scheme of weighted data in the step (2) are as follows: 1-3 layers of convolutional network share one piece of BRAM, consume 7.5
RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5 RAM36E1;9th layer makes
With one piece of BRAM, 7.5 RAM36E1 are consumed.
8. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature
It is, the storage scheme of feature diagram data in the step (2) are as follows: for level 1 volume one piece of Web vector graphic of product in a) and b)
BRAM, 2-6 layers of every layer of convolutional network respectively use two pieces of BRAM, and the 7th layer uses eight pieces of BRAM, and the 8th layer uses ten pieces of BRAM,
9th layer uses one piece of BRAM;One piece of BRAM is used for every layer in c);Every piece of BRAM consumes 7.5 RAM36E1.
9. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature
Be, the output of the convolutional network includes the location information of 13*13*5 candidate frame, the location information of each candidate frame by x,
Y, w, h value form, and respectively indicate the abscissa relative value of candidate frame central point, ordinate relative value, width relative value, height phase
To value;For horizontal, ordinate relative value through sigmoid Function Mapping into absolute coordinate, wide, high coordinate relative value passes through e index
It is mapped in absolute value.
10. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature
It is, the output candidate frame of the convolutional network has confidence information, specific to calculate step such as carrying out NMS operation
Under:
A) center point coordinate of each candidate frame is extracted in order, and flag bit is set to each candidate frame, for indicating entire
Whether candidate frame retains;
B) selecting first candidate frame is behind comparison other computational domain by the central point distance of comparison other candidate frame, when more than one
When a threshold value, that candidate frame flag bit for being compared is effective status, indicates that this candidate frame needs retain, otherwise
The mark position of the frame be it is invalid, the comparison of subsequent distance is no longer participate in, when the last one for being traversed queue by comparison other
When, replace comparison other, i.e., the effective candidate frame of next flag bit of comparison other just now;
C) the effective candidate frame of all flag bits is extracted from result memory, and generates a marking frame and prints to original
As final testing result in figure.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810970836.2A CN109214504B (en) | 2018-08-24 | 2018-08-24 | FPGA-based YOLO network forward reasoning accelerator design method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810970836.2A CN109214504B (en) | 2018-08-24 | 2018-08-24 | FPGA-based YOLO network forward reasoning accelerator design method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109214504A true CN109214504A (en) | 2019-01-15 |
CN109214504B CN109214504B (en) | 2020-09-04 |
Family
ID=64989693
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810970836.2A Active CN109214504B (en) | 2018-08-24 | 2018-08-24 | FPGA-based YOLO network forward reasoning accelerator design method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109214504B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110033086A (en) * | 2019-04-15 | 2019-07-19 | 北京异构智能科技有限公司 | Hardware accelerator for neural network convolution algorithm |
CN110175670A (en) * | 2019-04-09 | 2019-08-27 | 华中科技大学 | A kind of method and system for realizing YOLOv2 detection network based on FPGA |
CN110222835A (en) * | 2019-05-13 | 2019-09-10 | 西安交通大学 | A kind of convolutional neural networks hardware system and operation method based on zero value detection |
CN110263925A (en) * | 2019-06-04 | 2019-09-20 | 电子科技大学 | A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA |
CN110555516A (en) * | 2019-08-27 | 2019-12-10 | 上海交通大学 | FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method |
CN111752713A (en) * | 2020-06-28 | 2020-10-09 | 浪潮电子信息产业股份有限公司 | Method, device and equipment for balancing load of model parallel training task and storage medium |
CN111814675A (en) * | 2020-07-08 | 2020-10-23 | 上海雪湖科技有限公司 | Convolutional neural network characteristic diagram assembling system based on FPGA supporting dynamic resolution |
CN112052935A (en) * | 2019-06-06 | 2020-12-08 | 奇景光电股份有限公司 | Convolutional neural network system |
CN112085190A (en) * | 2019-06-12 | 2020-12-15 | 上海寒武纪信息科技有限公司 | Neural network quantitative parameter determination method and related product |
CN112470138A (en) * | 2019-11-29 | 2021-03-09 | 深圳市大疆创新科技有限公司 | Computing device, method, processor and mobile equipment |
CN113065303A (en) * | 2021-03-06 | 2021-07-02 | 杭州电子科技大学 | FPGA-based DSCNN accelerator layered verification method |
CN113297128A (en) * | 2020-02-24 | 2021-08-24 | 中科寒武纪科技股份有限公司 | Data processing method, data processing device, computer equipment and storage medium |
CN115049907A (en) * | 2022-08-17 | 2022-09-13 | 四川迪晟新达类脑智能技术有限公司 | FPGA-based YOLOV4 target detection network implementation method |
CN116737382A (en) * | 2023-06-20 | 2023-09-12 | 中国人民解放军国防科技大学 | Neural network reasoning acceleration method based on area folding |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7454546B1 (en) * | 2006-01-27 | 2008-11-18 | Xilinx, Inc. | Architecture for dynamically reprogrammable arbitration using memory |
US20160379115A1 (en) * | 2015-06-29 | 2016-12-29 | Microsoft Technology Licensing, Llc | Deep neural network processing on hardware accelerators with stacked memory |
CN106529517A (en) * | 2016-12-30 | 2017-03-22 | 北京旷视科技有限公司 | Image processing method and image processing device |
CN106650592A (en) * | 2016-10-05 | 2017-05-10 | 北京深鉴智能科技有限公司 | Target tracking system |
CN107066239A (en) * | 2017-03-01 | 2017-08-18 | 智擎信息系统(上海)有限公司 | A kind of hardware configuration for realizing convolutional neural networks forward calculation |
CN107451659A (en) * | 2017-07-27 | 2017-12-08 | 清华大学 | Neutral net accelerator and its implementation for bit wide subregion |
CN108108809A (en) * | 2018-03-05 | 2018-06-01 | 山东领能电子科技有限公司 | A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork |
CN108182471A (en) * | 2018-01-24 | 2018-06-19 | 上海岳芯电子科技有限公司 | A kind of convolutional neural networks reasoning accelerator and method |
EP3352113A1 (en) * | 2017-01-18 | 2018-07-25 | Hitachi, Ltd. | Calculation system and calculation method of neural network |
-
2018
- 2018-08-24 CN CN201810970836.2A patent/CN109214504B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7454546B1 (en) * | 2006-01-27 | 2008-11-18 | Xilinx, Inc. | Architecture for dynamically reprogrammable arbitration using memory |
US20160379115A1 (en) * | 2015-06-29 | 2016-12-29 | Microsoft Technology Licensing, Llc | Deep neural network processing on hardware accelerators with stacked memory |
CN106650592A (en) * | 2016-10-05 | 2017-05-10 | 北京深鉴智能科技有限公司 | Target tracking system |
CN106529517A (en) * | 2016-12-30 | 2017-03-22 | 北京旷视科技有限公司 | Image processing method and image processing device |
EP3352113A1 (en) * | 2017-01-18 | 2018-07-25 | Hitachi, Ltd. | Calculation system and calculation method of neural network |
CN107066239A (en) * | 2017-03-01 | 2017-08-18 | 智擎信息系统(上海)有限公司 | A kind of hardware configuration for realizing convolutional neural networks forward calculation |
CN107451659A (en) * | 2017-07-27 | 2017-12-08 | 清华大学 | Neutral net accelerator and its implementation for bit wide subregion |
CN108182471A (en) * | 2018-01-24 | 2018-06-19 | 上海岳芯电子科技有限公司 | A kind of convolutional neural networks reasoning accelerator and method |
CN108108809A (en) * | 2018-03-05 | 2018-06-01 | 山东领能电子科技有限公司 | A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork |
Non-Patent Citations (3)
Title |
---|
JING MA ET AL: "Hardware Implementation and Optimization of Tiny-YOLO Network", 《INTERNATIONAL FORUM ON DIGITAL TV AND WIRELESS MULTIMEDIA COMMUNICATIONS》 * |
VINCENT VANHOUCKE ET AL: "Improving the speed of neural networks on CPUs", 《DEEP LEARNING AND UNSUPERVISED FEATURE LEARNING WORKSHOP》 * |
陆志坚: "基于FPGA的卷积神经网络并行结构研究", 《中国博士学位论文全文数据库信息科技辑》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175670B (en) * | 2019-04-09 | 2020-12-08 | 华中科技大学 | Method and system for realizing YOLOv2 detection network based on FPGA |
CN110175670A (en) * | 2019-04-09 | 2019-08-27 | 华中科技大学 | A kind of method and system for realizing YOLOv2 detection network based on FPGA |
CN110033086A (en) * | 2019-04-15 | 2019-07-19 | 北京异构智能科技有限公司 | Hardware accelerator for neural network convolution algorithm |
CN110222835A (en) * | 2019-05-13 | 2019-09-10 | 西安交通大学 | A kind of convolutional neural networks hardware system and operation method based on zero value detection |
CN110263925A (en) * | 2019-06-04 | 2019-09-20 | 电子科技大学 | A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA |
CN110263925B (en) * | 2019-06-04 | 2022-03-15 | 电子科技大学 | Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA |
CN112052935A (en) * | 2019-06-06 | 2020-12-08 | 奇景光电股份有限公司 | Convolutional neural network system |
CN112085190A (en) * | 2019-06-12 | 2020-12-15 | 上海寒武纪信息科技有限公司 | Neural network quantitative parameter determination method and related product |
CN112085190B (en) * | 2019-06-12 | 2024-04-02 | 上海寒武纪信息科技有限公司 | Method for determining quantization parameter of neural network and related product |
CN110555516B (en) * | 2019-08-27 | 2023-10-27 | 合肥辉羲智能科技有限公司 | Method for realizing low-delay hardware accelerator of YOLOv2-tiny neural network based on FPGA |
CN110555516A (en) * | 2019-08-27 | 2019-12-10 | 上海交通大学 | FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method |
CN112470138A (en) * | 2019-11-29 | 2021-03-09 | 深圳市大疆创新科技有限公司 | Computing device, method, processor and mobile equipment |
WO2021102946A1 (en) * | 2019-11-29 | 2021-06-03 | 深圳市大疆创新科技有限公司 | Computing apparatus and method, processor, and movable device |
CN113297128B (en) * | 2020-02-24 | 2023-10-31 | 中科寒武纪科技股份有限公司 | Data processing method, device, computer equipment and storage medium |
CN113297128A (en) * | 2020-02-24 | 2021-08-24 | 中科寒武纪科技股份有限公司 | Data processing method, data processing device, computer equipment and storage medium |
CN111752713A (en) * | 2020-06-28 | 2020-10-09 | 浪潮电子信息产业股份有限公司 | Method, device and equipment for balancing load of model parallel training task and storage medium |
CN111752713B (en) * | 2020-06-28 | 2022-08-05 | 浪潮电子信息产业股份有限公司 | Method, device and equipment for balancing load of model parallel training task and storage medium |
US11868817B2 (en) | 2020-06-28 | 2024-01-09 | Inspur Electronic Information Industry Co., Ltd. | Load balancing method, apparatus and device for parallel model training task, and storage medium |
CN111814675A (en) * | 2020-07-08 | 2020-10-23 | 上海雪湖科技有限公司 | Convolutional neural network characteristic diagram assembling system based on FPGA supporting dynamic resolution |
CN111814675B (en) * | 2020-07-08 | 2023-09-29 | 上海雪湖科技有限公司 | Convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA |
CN113065303A (en) * | 2021-03-06 | 2021-07-02 | 杭州电子科技大学 | FPGA-based DSCNN accelerator layered verification method |
CN113065303B (en) * | 2021-03-06 | 2024-02-02 | 杭州电子科技大学 | DSCNN accelerator layering verification method based on FPGA |
CN115049907B (en) * | 2022-08-17 | 2022-10-28 | 四川迪晟新达类脑智能技术有限公司 | FPGA-based YOLOV4 target detection network implementation method |
CN115049907A (en) * | 2022-08-17 | 2022-09-13 | 四川迪晟新达类脑智能技术有限公司 | FPGA-based YOLOV4 target detection network implementation method |
CN116737382A (en) * | 2023-06-20 | 2023-09-12 | 中国人民解放军国防科技大学 | Neural network reasoning acceleration method based on area folding |
CN116737382B (en) * | 2023-06-20 | 2024-01-02 | 中国人民解放军国防科技大学 | Neural network reasoning acceleration method based on area folding |
Also Published As
Publication number | Publication date |
---|---|
CN109214504B (en) | 2020-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109214504A (en) | A kind of YOLO network forward inference accelerator design method based on FPGA | |
CN111242282B (en) | Deep learning model training acceleration method based on end edge cloud cooperation | |
CN111667051A (en) | Neural network accelerator suitable for edge equipment and neural network acceleration calculation method | |
CN106447034A (en) | Neutral network processor based on data compression, design method and chip | |
CN110674742B (en) | Remote sensing image road extraction method based on DLinkNet | |
CN107169563A (en) | Processing system and method applied to two-value weight convolutional network | |
CN107340993A (en) | A kind of apparatus and method for the neural network computing for supporting less digit floating number | |
CN110321997A (en) | High degree of parallelism computing platform, system and calculating implementation method | |
CN112465110A (en) | Hardware accelerator for convolution neural network calculation optimization | |
CN108596331A (en) | A kind of optimization method of cell neural network hardware structure | |
CN103177414A (en) | Structure-based dependency graph node similarity concurrent computation method | |
WO2023123919A1 (en) | Data processing circuit, data processing method, and related product | |
CN107092961A (en) | A kind of neural network processor and design method based on mode frequency statistical coding | |
CN113361695A (en) | Convolutional neural network accelerator | |
CN113591509B (en) | Training method of lane line detection model, image processing method and device | |
CN109460398A (en) | Complementing method, device and the electronic equipment of time series data | |
CN115394336A (en) | Storage and computation FPGA (field programmable Gate array) framework | |
CN103209328A (en) | Multi-source satellite image real-time online processing technical method and device | |
CN113592885A (en) | SegNet-RS network-based large obstacle contour segmentation method | |
CN112131444A (en) | Low-space-overhead large-scale triangle counting method and system in graph | |
CN116795324A (en) | Mixed precision floating-point multiplication device and mixed precision floating-point number processing method | |
CN116244612A (en) | HTTP traffic clustering method and device based on self-learning parameter measurement | |
CN109741421A (en) | A kind of Dynamic Graph color method based on GPU | |
CN115935888A (en) | Neural network accelerating system | |
CN115293978A (en) | Convolution operation circuit and method, image processing apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |