CN109214504A - A kind of YOLO network forward inference accelerator design method based on FPGA - Google Patents

A kind of YOLO network forward inference accelerator design method based on FPGA Download PDF

Info

Publication number
CN109214504A
CN109214504A CN201810970836.2A CN201810970836A CN109214504A CN 109214504 A CN109214504 A CN 109214504A CN 201810970836 A CN201810970836 A CN 201810970836A CN 109214504 A CN109214504 A CN 109214504A
Authority
CN
China
Prior art keywords
bram
layer
network
value
design method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810970836.2A
Other languages
Chinese (zh)
Other versions
CN109214504B (en
Inventor
张轶凡
陈昊
应山川
李玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Original Assignee
Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications filed Critical Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Priority to CN201810970836.2A priority Critical patent/CN109214504B/en
Publication of CN109214504A publication Critical patent/CN109214504A/en
Application granted granted Critical
Publication of CN109214504B publication Critical patent/CN109214504B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

The YOLO network forward inference accelerator design method based on FPGA that the invention proposes a kind of, the accelerator includes fpga chip and DRAM, and the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage;The accelerator design method pinpoints quantification the following steps are included: (1) carries out 8bit to former network data, obtains influencing the smallest scaling position to detection accuracy, forms quantization scheme, which successively carries out;(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;(3) position maps.The growth rate of the storage resource on fpga chip in the prior art is solved not as good as scale of neural network rapid development, general target detection network is difficult the technical problem being transplanted on fpga chip according to traditional mentality of designing, realizes and achievees the purpose that faster speed using less Resources on Chip.

Description

A kind of YOLO network forward inference accelerator design method based on FPGA
Technical field
The present invention relates to the technical fields of deep learning and Design of Hardware Architecture, more particularly, to a kind of target detection network The design method of forward inference acceleration is carried out on FPGA.
Background technique
In recent years, it is based on the machine learning algorithm of convolutional neural networks (Convolutional Neutral Network) It has been widely applied in the task of computer vision.It is computation-intensive but for large-scale CNN network, it deposits Storage is intensive, and the big feature of resource consumption brings huge challenge to above-mentioned task.Traditional general processor is in face of this kind of When height calculating pressure and big data throughput, performance is extremely difficult to practical requirement, therefore the hardware based on GPU, FPGA, ASIC Accelerator is suggested and widely comes into operation.
FPGA (Field Programmable Gate Array) field programmable gate array, is in PAL, GAL, EPLD The product further developed on the basis of equal programming devices.It is that occur as one of field ASIC semi-custom circuit , that is, solve the deficiency of custom circuit, and overcome the limited disadvantage of original programming device gate circuit number amount.FPGA is used Logical cell array LCA (Logic Cell Array) such a new concept, inside include configurable logic blocks CLB, Three parts output input module IOB and interconnector can support a PROM programming multiple FPGA.Due to flexibly may be used Ability and outstanding power dissipation ratio of performance are reconfigured, FPGA is made to become a kind of current important deep learning processor.
Being suitble to hard-wired mainstream target detection network at present is YOLO (You Only Look Once), this network Fast speed and structure is simple, this algorithm object detection issue handling at regression problem, with a convolutional neural networks knot Structure can directly predict position and the class probability of target frame from input picture, realize object detection end to end, this knot Structure is relatively more suitable for the hardware realization on FPGA.It is a kind of general fixed based on FPGA to disclose in existing invention CN107392309A Points neural network convolution accelerator hardware structure, comprising: general AXI4 high speed bus interface, high parallel-convolution core and characteristic pattern Data buffer area, convolutional calculation device, caches area controller, state controller and is directly accessed segmented convolution results buffer area Controller.The invention uses on piece storage as buffering, and main storage of the memory outside piece as data passes through one General processor outside piece carry out memory management to complete the calculating of entire convolutional network, the design of this structure only uses one Piece FPGA is unable to complete the forward inference of target detection network.A kind of convolutional Neural is proposed in existing invention CN107463990A The FPGA parallel acceleration method of network includes the following steps: that (1) establishes CNN model;(2) hardware structure is configured;(3) configuration volume Product arithmetic element.The invention loads the interim calculated result of whole network using on piece storage, therefore the network rule that can be disposed Mould is limited.
The results of intermediate calculations of network layer, is often all stored on piece by the existing neural network accelerator based on FPGA Static memory in, weight required for network is stored in the dynamic memory outside piece, and such design will lead on piece Storage space limits the network size that can accelerate.At this stage, as the demand of task complexity and precision is got higher, volume Product scale of neural network is increasing, and parameter total amount is also increasing, but the technique of fpga chip and on piece is open ended deposits The growth of resource is stored up but without so rapid, if still according to design method before, FPGA cannot accommodate this rule completely The network of mould.
If using the static memory BRAM of on piece as the buffer area of data, and the dynamic memory DRAM outside piece Key data as network stores, and since the memory space of dynamic memory is huge, it is very big can to accommodate number of parameters Network realizes the parallel computation of each convolution module by reasonably distributing the bandwidth of memory.The property of this design method The bandwidth of memory can be depended on, but stacks storage resource compared on piece, the bandwidth for promoting communication is more easily to realize 's.Network referenced by the present invention is the version of YOLO-tiny, the input size of this network is 416*416*3, network totally 9 Layer convolution, final output are the candidate frame with classification, position and confidence information, and pass through area maps (region behaviour Make) algorithm is mapped to calculated result in original image.
Summary of the invention
It is fast not as good as scale of neural network growth for the growth rate of the storage resource in solution in the prior art fpga chip Speed, general target detection network are difficult the technical problem being transplanted on fpga chip according to traditional mentality of designing, the present invention A kind of YOLO network forward inference accelerator based on FPGA is proposed for the development platform of YOLO-tiny network and KC705, Specific technical solution is as follows:
A kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip and DRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAM It is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtain influencing the smallest scaling position to detection accuracy, Quantization scheme is formed, which successively carries out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
Specifically, in the step (1) a certain layer quantizing process are as follows:
A) quantify the weight data of former network: a certain scaling position for first establishing 8bit fixed-point number is 256 kinds corresponding Metric value reuses nearby principle and quantifies to initial data, the numerical value after quantization is still wherein including positive zero and negative zero It is indicated the detection accuracy for obtaining such quantization scheme in order to calculate using the floating type of 32bit, then traverses 8 kinds small Number point, which postpones to obtain, influences the smallest scaling position to detection accuracy, eventually forms the weight quantization scheme of this layer;
B) distribution of 0-1 is first normalized to input feature vector figure, is then somebody's turn to do using method described in step a) The quantization of layer input feature vector figure;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction of all pictures is only carried out to this layer of convolution It propagates, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, it finally obtains all The quantization scheme of layer.
Specifically, in the step (2) each layer of convolutional network calculating process are as follows:
A) weighted data required for epicycle calculates is read from DRAM, is placed into BRAM;
B) feature diagram data (FM) of this layer to convolution is read, all input datas is completed and prepares;
C) convolutional calculation is carried out, after the completion of a wheel convolutional calculation, the data in BRAM are uploaded in DRAM again, clearly Empty interim findings data, then start the calculating of next round.
Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAM In a progress convolutional calculation, by obtained convolution results be added to switching input channel after convolutional calculation in, every time plus The input feature vector figure needs of load can just switch next input feature vector region after having been calculated one time with all convolution kernels.
It specifically, further include needing to carry out pond when calculating the final result for arriving a certain output channel in the step (2) Change operation and activation operation, detailed process is as follows, and when calculating convolution results one by one to certain a line, this line result is drawn two-by-two Point, and maximum value in two values is recorded, it is saved using the logical resource on fpga chip, arrives next line when calculating When, output result is divided two-by-two equally, takes the larger value therein, and carry out with the maximum value elected in lastrow Compare, using that value bigger in the two maximum values as the maximum value in a certain region 2*2, then with RELU activation primitive Threshold value be compared, result is saved in BRAM, in this way when the convolution for carrying out final result to a certain channel of output Afterwards, while also the pondization operation and activation operation in the channel are completed.
Step (2) a) and b) in BRAM be set as 512 data widths, depth design is 512 points, one piece of BRAM 7.5 RAMB36E1 are consumed, output minimum is set as 16bit;C) the dual-port mode that comes true is arranged in the BRAM in, and port width is 16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
Specifically, in the step (2) weighted data storage scheme are as follows: 1-3 layers of convolutional network share one piece of BRAM, Consume 7.5 RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5 RAM36E1;9th layer uses one piece of BRAM, consumes 7.5 RAM36E1.
Specifically, in the step (2) feature diagram data storage scheme are as follows: for level 1 volume product network in a) and b) Using one piece of BRAM, 2-6 layers of every layer of convolutional network respectively uses two pieces of BRAM, and the 7th layer uses eight pieces of BRAM, the 8th layer of use Ten pieces of BRAM, the 9th layer uses nine pieces of BRAM;One piece of BRAM is used for every layer in c);Every piece of BRAM consumes 7.5 RAM36E1.
Specifically, the output of the convolutional network includes the location information of 13*13*5 candidate frame, the position of each candidate frame Confidence breath is made of x, y, w, h value, respectively indicates the abscissa relative value of candidate frame central point, ordinate relative value, width phase To value, height relative value;For horizontal, ordinate relative value through sigmoid Function Mapping into absolute coordinate, wide, high coordinate is opposite Value is mapped in absolute value by e index.
The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operation It is as follows:
A) center point coordinate of each candidate frame is extracted in order, and flag bit is set to each candidate frame, for indicating Whether entire candidate frame retains.
B) selecting first candidate frame is by the central point distance of comparison other candidate frame behind comparison other computational domain, when super When crossing a threshold value, that the candidate frame flag bit compared is effective status, indicates that this candidate frame needs retain, no Then be the mark position of the frame it is invalid, the comparison of subsequent distance is no longer participate in, when traversing the last of queue by comparison other At one, comparison other, i.e., the effective candidate frame of next flag bit of comparison other just now are replaced.
C) the effective candidate frame of all flag bits is extracted from result memory, and generates a marking frame printing Into original image as final testing result.
The invention has the following advantages:
One, the present invention is using the memory on fpga chip as the data buffer zone of convolutional calculation, depositing outside fpga chip Reservoir as main storage equipment, each convolutional layer by piece outside memory be coupled, this design method is more than Suitable for YOLO network, it is equally applicable to other neural network.
Two, the resource allocation methods for each layer convolutional calculation that the present invention is carried out can play to greatest extent whole network simultaneously Row calculate ability, compare with serial convolutional calculation structure, the Resources on Chip that the design of this programme uses is less, forward inference Speed faster.
Three, on fpga chip, without direct data interaction between each layer, the connection of each layer belongs to loose coupling relation, It can guarantee the stability of system in this way.
Four, the calculating for accelerating whole network present invention uses simple version is carried out without using the method for overlapping area It calculates but is simplified with the central point distance of two frames, can greatly improve the speed of NMS step.
Detailed description of the invention
Fig. 1 is the calculating structure and storage organization schematic diagram of each layer of the present invention
Fig. 2 is single layer network calculation flow chart of the present invention
Specific embodiment
Embodiment 1
A kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip and DRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAM It is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtain influencing the smallest scaling position to detection accuracy, Quantization scheme is formed, which successively carries out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
Specifically, in the step (1) a certain layer quantizing process are as follows:
A) quantify the weight data of former network: when being quantified according to a certain decimal position of 8bit fixed-point number, first building The decimal value table of comparisons under the position, i.e. 256 kinds of decimal numbers are found, wherein including positive zero and negative zero, then using former nearby Then initial data is quantified, after quantization although value changes, but data are still the floating type of 32bit, convenient for below It is calculated in GPU, obtains the detection accuracy of such quantization scheme, obtained after then traversing 8 kinds of scaling positions to detection Precision influences the smallest scaling position, eventually forms the weight quantization scheme of this layer;
B) all integrated test input feature vector figures are first normalized with the distribution of 0-1, then using described in step a) Method carry out the quantization of this layer of input feature vector figure;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction of all pictures is only carried out to this layer of convolution It propagates, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, it finally obtains all The quantization scheme of layer.
Specifically, in the step (2) each layer of convolutional network calculating process are as follows: first from DRAM read epicycle calculate Required weighted data is placed into weight buffer BRAM;Then feature diagram data (FM) of this layer to convolution is read, it is complete Start to carry out convolutional calculation after preparing at all input datas, after the completion of a wheel convolutional calculation, in results buffer BRAM Data upload in DRAM again, empty interim findings data, then start the calculating of next round.Due to next layer of calculating One layer of calculated result is relied on, is mutually waited to allow every layer can be calculated simultaneously, table tennis is used in DRAM Structure, to play the computation capability in FPGA.On fpga chip, without direct data interaction, each layer between each layer Connection belong to loose coupling relation, can guarantee the stability of system in this way.
Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAM In a progress convolutional calculation, since BRAM is resource-constrained on fpga chip, and the size of this layer of picture is bigger, therefore only Continuous several rows in first Loading Image, according to the principle of convolutional calculation, the convolution results of this several row are final output at this time The interim findings of corresponding region (that several row of same position) in a certain channel arrive phase calculating behind the channel of switching input When with convolution at position, need to add up with interim findings before, so before this layer of module executes convolutional calculation, The interim convolution calculated result first before fetching in DDR from same position corresponding to the output channel is needed, in this way every After secondary convolution module calculates result, it can be added and be stored in result memory BRAM again with the value in result memory BRAM. The input feature vector figure needs loaded every time can just switch next input feature vector region after having been calculated one time with all convolution kernels.
It specifically, further include needing to carry out pond when calculating the final result for arriving a certain output channel in the step (2) Change operation and activation operation, detailed process is as follows, and when calculating convolution results one by one to certain a line, this line result is drawn two-by-two Point, and maximum value in two values is recorded, it is saved using the logical resource on fpga chip, arrives next line when calculating When, output result is divided two-by-two equally, takes the larger value therein, and carry out with the maximum value elected in lastrow Compare, using that value bigger in the two maximum values as the maximum value in a certain region 2*2, then with RELU activation primitive Threshold value be compared, result is saved in BRAM, in this way when the convolution for carrying out final result to a certain channel of output Afterwards, while also the pondization operation and activation operation in the channel are completed.
The step (2) needs to receive the data read out in DRAM as the BRAM of data buffer, in order to play The maximum bandwidth of DRAM, the write-in end of BRAM are set as 512 data widths, and depth design is 512 points, one piece of BRAM consumption 7.5 RAMB36E1, output minimum are set as 16bit, the input width as convolution operation;The BRAM of buffer as a result It to be not only also written in DRAM from data are read in DRAM simultaneously, therefore the dual-port mode that comes true is set, port width is 16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
Specifically, in the step (2) weighted data storage scheme are as follows: 1-3 layers of convolutional network share one piece of BRAM, Consume 7.5 RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5 RAM36E1;9th layer uses one piece of BRAM, consumes 7.5 RAM36E1.
Specifically, in the step (2) feature diagram data storage scheme are as follows: for Input Data Buffer, level 1 volume Product one piece of BRAM of Web vector graphic, 2-6 layer of every layer of convolutional network respectively use two pieces of BRAM, the 7th layer use eight pieces of BRAM, the 8th Layer uses ten pieces of BRAM, and the 9th layer uses nine pieces of BRAM;For output data buffer, every layer uses one piece of BRAM;Every piece of BRAM 7.5 RAM36E1 are consumed, total characteristic pattern data buffer needs 337.5 RAM36E1.Since BRAM is resource-constrained, only exist Do ping-pong operation at output buffer, data of each layer in input buffer be not ready to before without convolutional calculation. According to every layer of multiply-add calculation amount come etc. ratios distribution every layer of parallel computation port number and every layer of corresponding parallel channel number It is shown in Table 1, for the parallel channel of input, each channel requires one piece of individual BRAM and stores, but result cache Device only needs the big BRAM such as a piece of to store.
The distribution of every layer of calculation amount ratio of 1 parallel computation of table and every layer of parallel channel number
Layer One Two Three Four Five Six Seven Eight Nine
Ratio 1 2.5 2.5 2.5 2.5 2.5 10 20 1
PE quantity 1 2 2 2 2 2 8 16 1
Specifically, conventional part is followed by the region layer operation of position mapping, and the output of the convolutional network includes 13* The location information of the location information of 13*5 candidate frame, each candidate frame is made of x, y, w, h value, respectively indicates candidate frame center The abscissa relative value of point, ordinate relative value, width relative value, height relative value;This four values are needed by some processing It can just be mapped in actual Pictures location, horizontal, ordinate relative value passes through sigmoid Function Mapping into absolute coordinate, Due to output the result is that 8bit fixed-point representation, corresponding output result can quantify into a look-up table, accelerate this Mapping process;Wide, high coordinate relative value is mapped in absolute value by e index, and the form of equally applicable look-up table obtains result.
The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operation It is as follows: first to extract the center point coordinate of each candidate frame in order, and flag bit is set to each candidate frame, for indicating entire Whether candidate frame retains;Due to using central point distance to be judged for index, according to prior information it is found that in network output In candidate frame, the closer frame of sequence is the object compared, and sequence is farther away can be ignored and compare, and then selects first candidate frame to be By the central point distance of comparison other candidate frame behind comparison other computational domain, when more than a threshold value, compared that Candidate frame flag bit be effective status, indicate this candidate frame needs retain, be otherwise the mark position of the frame it is invalid, It is no longer participate in the comparison of subsequent distance, when traversing the last one of queue by comparison other, replaces comparison other, i.e., just now Comparison other the effective candidate frame of next flag bit;Finally the effective candidate frame of all flag bits from result memory In extract, and generate a marking frame print in original image as final testing result.
The above embodiments and description only illustrate the principle of the present invention, is not departing from spirit of that invention and model Under the premise of enclosing, various changes and improvements may be made to the invention, these changes and improvements both fall within claimed invention model In enclosing.The scope of the present invention is defined by the appended claims and its equivalents.

Claims (10)

1. a kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip and DRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAM It is middle to use ping-pong structure;It is characterized in that, the accelerator design method the following steps are included:
(1) 8bit is carried out to former network data and pinpoints quantification, obtained influencing the smallest scaling position to detection accuracy, be formed Quantization scheme, the quantizing process successively carry out;
(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO;
(3) position maps.
2. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature It is, the quantizing process of a certain layer in the step (1) are as follows:
A) quantify the weight data of former network: first establish 8bit fixed-point number a certain scaling position it is corresponding 256 kind ten into The value of system reuses nearby principle and quantifies to initial data, the numerical value after quantization still uses wherein including positive zero and negative zero The floating type of 32bit is indicated to calculate, and obtains the detection accuracy of such quantization scheme, then traverses 8 kinds of decimal points It obtains influencing the smallest scaling position to detection accuracy behind position, eventually forms the weight quantization scheme of this layer;
B) distribution of 0-1 input feature vector figure is first normalized, then using method described in step a) carry out this layer it is defeated Enter the quantization of characteristic pattern;
C) using the characteristic pattern after quantization in the step b) as input, the forward direction for only carrying out all pictures to this layer of convolution is passed It broadcasts, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.
D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, finally obtain all layers Quantization scheme.
3. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature It is, the calculating process of each layer of convolutional network in the step (2) are as follows:
A) weighted data required for epicycle calculates is read from DRAM, is placed into BRAM;
B) feature diagram data (FM) of this layer to convolution is read, all input datas is completed and prepares;
C) convolutional calculation is carried out, after the completion of a wheel convolutional calculation, the data in BRAM is uploaded in DRAM again, empties and faces When result data, then start the calculating of next round.
4. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature Be, when carrying out first layer convolution in the step (2), first from one loaded in DRAM in input feature vector figure triple channel into Obtained convolution results are added in the convolutional calculation after switching input channel by row convolutional calculation, and the input loaded every time is special Sign figure needs can just switch next input feature vector region after having been calculated one time with all convolution kernels.
5. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature It is, further includes needing to carry out pondization operation when calculating the final result for arriving a certain output channel and swashing in the step (2) Operation living, detailed process is as follows, when calculating convolution results one by one to certain a line, this line result is divided two-by-two, and two Maximum value is recorded in a value, is saved using the logical resource on fpga chip, when next line is arrived in calculating, together Sample divides output result two-by-two, the larger value therein is taken, and be compared with the maximum value elected in lastrow, this That bigger value is as the maximum value in a certain region 2*2 in two maximum values, then with the threshold value of RELU activation primitive It is compared, result is saved in BRAM, in this way after carrying out the convolution of final result to a certain channel of output, while Complete the pondization operation and activation operation in the channel.
6. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature Be, step (2) a) and b) in BRAM be set as 512 data widths, depth design is 512 points, and one piece of BRAM disappears 7.5 RAMB36E1 are consumed, output minimum is set as 16bit;C) the dual-port mode that comes true is arranged in the BRAM in, and port width is 16bit;The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.
7. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature It is, the storage scheme of weighted data in the step (2) are as follows: 1-3 layers of convolutional network share one piece of BRAM, consume 7.5 RAM36E1;4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5 RAM36E1;9th layer makes With one piece of BRAM, 7.5 RAM36E1 are consumed.
8. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature It is, the storage scheme of feature diagram data in the step (2) are as follows: for level 1 volume one piece of Web vector graphic of product in a) and b) BRAM, 2-6 layers of every layer of convolutional network respectively use two pieces of BRAM, and the 7th layer uses eight pieces of BRAM, and the 8th layer uses ten pieces of BRAM, 9th layer uses one piece of BRAM;One piece of BRAM is used for every layer in c);Every piece of BRAM consumes 7.5 RAM36E1.
9. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature Be, the output of the convolutional network includes the location information of 13*13*5 candidate frame, the location information of each candidate frame by x, Y, w, h value form, and respectively indicate the abscissa relative value of candidate frame central point, ordinate relative value, width relative value, height phase To value;For horizontal, ordinate relative value through sigmoid Function Mapping into absolute coordinate, wide, high coordinate relative value passes through e index It is mapped in absolute value.
10. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature It is, the output candidate frame of the convolutional network has confidence information, specific to calculate step such as carrying out NMS operation Under:
A) center point coordinate of each candidate frame is extracted in order, and flag bit is set to each candidate frame, for indicating entire Whether candidate frame retains;
B) selecting first candidate frame is behind comparison other computational domain by the central point distance of comparison other candidate frame, when more than one When a threshold value, that candidate frame flag bit for being compared is effective status, indicates that this candidate frame needs retain, otherwise The mark position of the frame be it is invalid, the comparison of subsequent distance is no longer participate in, when the last one for being traversed queue by comparison other When, replace comparison other, i.e., the effective candidate frame of next flag bit of comparison other just now;
C) the effective candidate frame of all flag bits is extracted from result memory, and generates a marking frame and prints to original As final testing result in figure.
CN201810970836.2A 2018-08-24 2018-08-24 FPGA-based YOLO network forward reasoning accelerator design method Active CN109214504B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810970836.2A CN109214504B (en) 2018-08-24 2018-08-24 FPGA-based YOLO network forward reasoning accelerator design method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810970836.2A CN109214504B (en) 2018-08-24 2018-08-24 FPGA-based YOLO network forward reasoning accelerator design method

Publications (2)

Publication Number Publication Date
CN109214504A true CN109214504A (en) 2019-01-15
CN109214504B CN109214504B (en) 2020-09-04

Family

ID=64989693

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810970836.2A Active CN109214504B (en) 2018-08-24 2018-08-24 FPGA-based YOLO network forward reasoning accelerator design method

Country Status (1)

Country Link
CN (1) CN109214504B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033086A (en) * 2019-04-15 2019-07-19 北京异构智能科技有限公司 Hardware accelerator for neural network convolution algorithm
CN110175670A (en) * 2019-04-09 2019-08-27 华中科技大学 A kind of method and system for realizing YOLOv2 detection network based on FPGA
CN110222835A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of convolutional neural networks hardware system and operation method based on zero value detection
CN110263925A (en) * 2019-06-04 2019-09-20 电子科技大学 A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA
CN110555516A (en) * 2019-08-27 2019-12-10 上海交通大学 FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method
CN111752713A (en) * 2020-06-28 2020-10-09 浪潮电子信息产业股份有限公司 Method, device and equipment for balancing load of model parallel training task and storage medium
CN111814675A (en) * 2020-07-08 2020-10-23 上海雪湖科技有限公司 Convolutional neural network characteristic diagram assembling system based on FPGA supporting dynamic resolution
CN112052935A (en) * 2019-06-06 2020-12-08 奇景光电股份有限公司 Convolutional neural network system
CN112085190A (en) * 2019-06-12 2020-12-15 上海寒武纪信息科技有限公司 Neural network quantitative parameter determination method and related product
CN112470138A (en) * 2019-11-29 2021-03-09 深圳市大疆创新科技有限公司 Computing device, method, processor and mobile equipment
CN113065303A (en) * 2021-03-06 2021-07-02 杭州电子科技大学 FPGA-based DSCNN accelerator layered verification method
CN113297128A (en) * 2020-02-24 2021-08-24 中科寒武纪科技股份有限公司 Data processing method, data processing device, computer equipment and storage medium
CN115049907A (en) * 2022-08-17 2022-09-13 四川迪晟新达类脑智能技术有限公司 FPGA-based YOLOV4 target detection network implementation method
CN116737382A (en) * 2023-06-20 2023-09-12 中国人民解放军国防科技大学 Neural network reasoning acceleration method based on area folding

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454546B1 (en) * 2006-01-27 2008-11-18 Xilinx, Inc. Architecture for dynamically reprogrammable arbitration using memory
US20160379115A1 (en) * 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Deep neural network processing on hardware accelerators with stacked memory
CN106529517A (en) * 2016-12-30 2017-03-22 北京旷视科技有限公司 Image processing method and image processing device
CN106650592A (en) * 2016-10-05 2017-05-10 北京深鉴智能科技有限公司 Target tracking system
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107451659A (en) * 2017-07-27 2017-12-08 清华大学 Neutral net accelerator and its implementation for bit wide subregion
CN108108809A (en) * 2018-03-05 2018-06-01 山东领能电子科技有限公司 A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork
CN108182471A (en) * 2018-01-24 2018-06-19 上海岳芯电子科技有限公司 A kind of convolutional neural networks reasoning accelerator and method
EP3352113A1 (en) * 2017-01-18 2018-07-25 Hitachi, Ltd. Calculation system and calculation method of neural network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454546B1 (en) * 2006-01-27 2008-11-18 Xilinx, Inc. Architecture for dynamically reprogrammable arbitration using memory
US20160379115A1 (en) * 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Deep neural network processing on hardware accelerators with stacked memory
CN106650592A (en) * 2016-10-05 2017-05-10 北京深鉴智能科技有限公司 Target tracking system
CN106529517A (en) * 2016-12-30 2017-03-22 北京旷视科技有限公司 Image processing method and image processing device
EP3352113A1 (en) * 2017-01-18 2018-07-25 Hitachi, Ltd. Calculation system and calculation method of neural network
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
CN107451659A (en) * 2017-07-27 2017-12-08 清华大学 Neutral net accelerator and its implementation for bit wide subregion
CN108182471A (en) * 2018-01-24 2018-06-19 上海岳芯电子科技有限公司 A kind of convolutional neural networks reasoning accelerator and method
CN108108809A (en) * 2018-03-05 2018-06-01 山东领能电子科技有限公司 A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JING MA ET AL: "Hardware Implementation and Optimization of Tiny-YOLO Network", 《INTERNATIONAL FORUM ON DIGITAL TV AND WIRELESS MULTIMEDIA COMMUNICATIONS》 *
VINCENT VANHOUCKE ET AL: "Improving the speed of neural networks on CPUs", 《DEEP LEARNING AND UNSUPERVISED FEATURE LEARNING WORKSHOP》 *
陆志坚: "基于FPGA的卷积神经网络并行结构研究", 《中国博士学位论文全文数据库信息科技辑》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175670B (en) * 2019-04-09 2020-12-08 华中科技大学 Method and system for realizing YOLOv2 detection network based on FPGA
CN110175670A (en) * 2019-04-09 2019-08-27 华中科技大学 A kind of method and system for realizing YOLOv2 detection network based on FPGA
CN110033086A (en) * 2019-04-15 2019-07-19 北京异构智能科技有限公司 Hardware accelerator for neural network convolution algorithm
CN110222835A (en) * 2019-05-13 2019-09-10 西安交通大学 A kind of convolutional neural networks hardware system and operation method based on zero value detection
CN110263925A (en) * 2019-06-04 2019-09-20 电子科技大学 A kind of hardware-accelerated realization framework of the convolutional neural networks forward prediction based on FPGA
CN110263925B (en) * 2019-06-04 2022-03-15 电子科技大学 Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
CN112052935A (en) * 2019-06-06 2020-12-08 奇景光电股份有限公司 Convolutional neural network system
CN112085190A (en) * 2019-06-12 2020-12-15 上海寒武纪信息科技有限公司 Neural network quantitative parameter determination method and related product
CN112085190B (en) * 2019-06-12 2024-04-02 上海寒武纪信息科技有限公司 Method for determining quantization parameter of neural network and related product
CN110555516B (en) * 2019-08-27 2023-10-27 合肥辉羲智能科技有限公司 Method for realizing low-delay hardware accelerator of YOLOv2-tiny neural network based on FPGA
CN110555516A (en) * 2019-08-27 2019-12-10 上海交通大学 FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method
CN112470138A (en) * 2019-11-29 2021-03-09 深圳市大疆创新科技有限公司 Computing device, method, processor and mobile equipment
WO2021102946A1 (en) * 2019-11-29 2021-06-03 深圳市大疆创新科技有限公司 Computing apparatus and method, processor, and movable device
CN113297128B (en) * 2020-02-24 2023-10-31 中科寒武纪科技股份有限公司 Data processing method, device, computer equipment and storage medium
CN113297128A (en) * 2020-02-24 2021-08-24 中科寒武纪科技股份有限公司 Data processing method, data processing device, computer equipment and storage medium
CN111752713A (en) * 2020-06-28 2020-10-09 浪潮电子信息产业股份有限公司 Method, device and equipment for balancing load of model parallel training task and storage medium
CN111752713B (en) * 2020-06-28 2022-08-05 浪潮电子信息产业股份有限公司 Method, device and equipment for balancing load of model parallel training task and storage medium
US11868817B2 (en) 2020-06-28 2024-01-09 Inspur Electronic Information Industry Co., Ltd. Load balancing method, apparatus and device for parallel model training task, and storage medium
CN111814675A (en) * 2020-07-08 2020-10-23 上海雪湖科技有限公司 Convolutional neural network characteristic diagram assembling system based on FPGA supporting dynamic resolution
CN111814675B (en) * 2020-07-08 2023-09-29 上海雪湖科技有限公司 Convolutional neural network feature map assembly system supporting dynamic resolution based on FPGA
CN113065303A (en) * 2021-03-06 2021-07-02 杭州电子科技大学 FPGA-based DSCNN accelerator layered verification method
CN113065303B (en) * 2021-03-06 2024-02-02 杭州电子科技大学 DSCNN accelerator layering verification method based on FPGA
CN115049907B (en) * 2022-08-17 2022-10-28 四川迪晟新达类脑智能技术有限公司 FPGA-based YOLOV4 target detection network implementation method
CN115049907A (en) * 2022-08-17 2022-09-13 四川迪晟新达类脑智能技术有限公司 FPGA-based YOLOV4 target detection network implementation method
CN116737382A (en) * 2023-06-20 2023-09-12 中国人民解放军国防科技大学 Neural network reasoning acceleration method based on area folding
CN116737382B (en) * 2023-06-20 2024-01-02 中国人民解放军国防科技大学 Neural network reasoning acceleration method based on area folding

Also Published As

Publication number Publication date
CN109214504B (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN109214504A (en) A kind of YOLO network forward inference accelerator design method based on FPGA
CN111242282B (en) Deep learning model training acceleration method based on end edge cloud cooperation
CN111667051A (en) Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
CN106447034A (en) Neutral network processor based on data compression, design method and chip
CN110674742B (en) Remote sensing image road extraction method based on DLinkNet
CN107169563A (en) Processing system and method applied to two-value weight convolutional network
CN107340993A (en) A kind of apparatus and method for the neural network computing for supporting less digit floating number
CN110321997A (en) High degree of parallelism computing platform, system and calculating implementation method
CN112465110A (en) Hardware accelerator for convolution neural network calculation optimization
CN108596331A (en) A kind of optimization method of cell neural network hardware structure
CN103177414A (en) Structure-based dependency graph node similarity concurrent computation method
WO2023123919A1 (en) Data processing circuit, data processing method, and related product
CN107092961A (en) A kind of neural network processor and design method based on mode frequency statistical coding
CN113361695A (en) Convolutional neural network accelerator
CN113591509B (en) Training method of lane line detection model, image processing method and device
CN109460398A (en) Complementing method, device and the electronic equipment of time series data
CN115394336A (en) Storage and computation FPGA (field programmable Gate array) framework
CN103209328A (en) Multi-source satellite image real-time online processing technical method and device
CN113592885A (en) SegNet-RS network-based large obstacle contour segmentation method
CN112131444A (en) Low-space-overhead large-scale triangle counting method and system in graph
CN116795324A (en) Mixed precision floating-point multiplication device and mixed precision floating-point number processing method
CN116244612A (en) HTTP traffic clustering method and device based on self-learning parameter measurement
CN109741421A (en) A kind of Dynamic Graph color method based on GPU
CN115935888A (en) Neural network accelerating system
CN115293978A (en) Convolution operation circuit and method, image processing apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant