CN109214504A

CN109214504A - A kind of YOLO network forward inference accelerator design method based on FPGA

Info

Publication number: CN109214504A
Application number: CN201810970836.2A
Authority: CN
Inventors: 张轶凡; 陈昊; 应山川; 李玮
Original assignee: Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Current assignee: Shenzhen Research Institute Of Beijing University Of Posts And Telecommunications
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2019-01-15
Anticipated expiration: 2038-08-24
Also published as: CN109214504B

Abstract

The YOLO network forward inference accelerator design method based on FPGA that the invention proposes a kind of, the accelerator includes fpga chip and DRAM, and the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage；The accelerator design method pinpoints quantification the following steps are included: (1) carries out 8bit to former network data, obtains influencing the smallest scaling position to detection accuracy, forms quantization scheme, which successively carries out；(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO；(3) position maps.The growth rate of the storage resource on fpga chip in the prior art is solved not as good as scale of neural network rapid development, general target detection network is difficult the technical problem being transplanted on fpga chip according to traditional mentality of designing, realizes and achievees the purpose that faster speed using less Resources on Chip.

Description

A kind of YOLO network forward inference accelerator design method based on FPGA

Technical field

The present invention relates to the technical fields of deep learning and Design of Hardware Architecture, more particularly, to a kind of target detection network The design method of forward inference acceleration is carried out on FPGA.

Background technique

In recent years, it is based on the machine learning algorithm of convolutional neural networks (Convolutional Neutral Network) It has been widely applied in the task of computer vision.It is computation-intensive but for large-scale CNN network, it deposits Storage is intensive, and the big feature of resource consumption brings huge challenge to above-mentioned task.Traditional general processor is in face of this kind of When height calculating pressure and big data throughput, performance is extremely difficult to practical requirement, therefore the hardware based on GPU, FPGA, ASIC Accelerator is suggested and widely comes into operation.

FPGA (Field Programmable Gate Array) field programmable gate array, is in PAL, GAL, EPLD The product further developed on the basis of equal programming devices.It is that occur as one of field ASIC semi-custom circuit , that is, solve the deficiency of custom circuit, and overcome the limited disadvantage of original programming device gate circuit number amount.FPGA is used Logical cell array LCA (Logic Cell Array) such a new concept, inside include configurable logic blocks CLB, Three parts output input module IOB and interconnector can support a PROM programming multiple FPGA.Due to flexibly may be used Ability and outstanding power dissipation ratio of performance are reconfigured, FPGA is made to become a kind of current important deep learning processor.

Being suitble to hard-wired mainstream target detection network at present is YOLO (You Only Look Once), this network Fast speed and structure is simple, this algorithm object detection issue handling at regression problem, with a convolutional neural networks knot Structure can directly predict position and the class probability of target frame from input picture, realize object detection end to end, this knot Structure is relatively more suitable for the hardware realization on FPGA.It is a kind of general fixed based on FPGA to disclose in existing invention CN107392309A Points neural network convolution accelerator hardware structure, comprising: general AXI4 high speed bus interface, high parallel-convolution core and characteristic pattern Data buffer area, convolutional calculation device, caches area controller, state controller and is directly accessed segmented convolution results buffer area Controller.The invention uses on piece storage as buffering, and main storage of the memory outside piece as data passes through one General processor outside piece carry out memory management to complete the calculating of entire convolutional network, the design of this structure only uses one Piece FPGA is unable to complete the forward inference of target detection network.A kind of convolutional Neural is proposed in existing invention CN107463990A The FPGA parallel acceleration method of network includes the following steps: that (1) establishes CNN model；(2) hardware structure is configured；(3) configuration volume Product arithmetic element.The invention loads the interim calculated result of whole network using on piece storage, therefore the network rule that can be disposed Mould is limited.

The results of intermediate calculations of network layer, is often all stored on piece by the existing neural network accelerator based on FPGA Static memory in, weight required for network is stored in the dynamic memory outside piece, and such design will lead on piece Storage space limits the network size that can accelerate.At this stage, as the demand of task complexity and precision is got higher, volume Product scale of neural network is increasing, and parameter total amount is also increasing, but the technique of fpga chip and on piece is open ended deposits The growth of resource is stored up but without so rapid, if still according to design method before, FPGA cannot accommodate this rule completely The network of mould.

If using the static memory BRAM of on piece as the buffer area of data, and the dynamic memory DRAM outside piece Key data as network stores, and since the memory space of dynamic memory is huge, it is very big can to accommodate number of parameters Network realizes the parallel computation of each convolution module by reasonably distributing the bandwidth of memory.The property of this design method The bandwidth of memory can be depended on, but stacks storage resource compared on piece, the bandwidth for promoting communication is more easily to realize 's.Network referenced by the present invention is the version of YOLO-tiny, the input size of this network is 416*416*3, network totally 9 Layer convolution, final output are the candidate frame with classification, position and confidence information, and pass through area maps (region behaviour Make) algorithm is mapped to calculated result in original image.

Summary of the invention

It is fast not as good as scale of neural network growth for the growth rate of the storage resource in solution in the prior art fpga chip Speed, general target detection network are difficult the technical problem being transplanted on fpga chip according to traditional mentality of designing, the present invention A kind of YOLO network forward inference accelerator based on FPGA is proposed for the development platform of YOLO-tiny network and KC705, Specific technical solution is as follows:

A kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip and DRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAM It is middle to use ping-pong structure；It is characterized in that, the accelerator design method the following steps are included:

(1) 8bit is carried out to former network data and pinpoints quantification, obtain influencing the smallest scaling position to detection accuracy, Quantization scheme is formed, which successively carries out；

(2) fpga chip makees parallel computation to nine layers of convolutional network of YOLO；

(3) position maps.

Specifically, in the step (1) a certain layer quantizing process are as follows:

A) quantify the weight data of former network: a certain scaling position for first establishing 8bit fixed-point number is 256 kinds corresponding Metric value reuses nearby principle and quantifies to initial data, the numerical value after quantization is still wherein including positive zero and negative zero It is indicated the detection accuracy for obtaining such quantization scheme in order to calculate using the floating type of 32bit, then traverses 8 kinds small Number point, which postpones to obtain, influences the smallest scaling position to detection accuracy, eventually forms the weight quantization scheme of this layer；

B) distribution of 0-1 is first normalized to input feature vector figure, is then somebody's turn to do using method described in step a) The quantization of layer input feature vector figure；

C) using the characteristic pattern after quantization in the step b) as input, the forward direction of all pictures is only carried out to this layer of convolution It propagates, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.

D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, it finally obtains all The quantization scheme of layer.

Specifically, in the step (2) each layer of convolutional network calculating process are as follows:

A) weighted data required for epicycle calculates is read from DRAM, is placed into BRAM；

B) feature diagram data (FM) of this layer to convolution is read, all input datas is completed and prepares；

C) convolutional calculation is carried out, after the completion of a wheel convolutional calculation, the data in BRAM are uploaded in DRAM again, clearly Empty interim findings data, then start the calculating of next round.

Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAM In a progress convolutional calculation, by obtained convolution results be added to switching input channel after convolutional calculation in, every time plus The input feature vector figure needs of load can just switch next input feature vector region after having been calculated one time with all convolution kernels.

It specifically, further include needing to carry out pond when calculating the final result for arriving a certain output channel in the step (2) Change operation and activation operation, detailed process is as follows, and when calculating convolution results one by one to certain a line, this line result is drawn two-by-two Point, and maximum value in two values is recorded, it is saved using the logical resource on fpga chip, arrives next line when calculating When, output result is divided two-by-two equally, takes the larger value therein, and carry out with the maximum value elected in lastrow Compare, using that value bigger in the two maximum values as the maximum value in a certain region 2*2, then with RELU activation primitive Threshold value be compared, result is saved in BRAM, in this way when the convolution for carrying out final result to a certain channel of output Afterwards, while also the pondization operation and activation operation in the channel are completed.

Step (2) a) and b) in BRAM be set as 512 data widths, depth design is 512 points, one piece of BRAM 7.5 RAMB36E1 are consumed, output minimum is set as 16bit；C) the dual-port mode that comes true is arranged in the BRAM in, and port width is 16bit；The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.

Specifically, in the step (2) weighted data storage scheme are as follows: 1-3 layers of convolutional network share one piece of BRAM, Consume 7.5 RAM36E1；4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5 RAM36E1；9th layer uses one piece of BRAM, consumes 7.5 RAM36E1.

Specifically, in the step (2) feature diagram data storage scheme are as follows: for level 1 volume product network in a) and b) Using one piece of BRAM, 2-6 layers of every layer of convolutional network respectively uses two pieces of BRAM, and the 7th layer uses eight pieces of BRAM, the 8th layer of use Ten pieces of BRAM, the 9th layer uses nine pieces of BRAM；One piece of BRAM is used for every layer in c)；Every piece of BRAM consumes 7.5 RAM36E1.

Specifically, the output of the convolutional network includes the location information of 13*13*5 candidate frame, the position of each candidate frame Confidence breath is made of x, y, w, h value, respectively indicates the abscissa relative value of candidate frame central point, ordinate relative value, width phase To value, height relative value；For horizontal, ordinate relative value through sigmoid Function Mapping into absolute coordinate, wide, high coordinate is opposite Value is mapped in absolute value by e index.

The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operation It is as follows:

A) center point coordinate of each candidate frame is extracted in order, and flag bit is set to each candidate frame, for indicating Whether entire candidate frame retains.

B) selecting first candidate frame is by the central point distance of comparison other candidate frame behind comparison other computational domain, when super When crossing a threshold value, that the candidate frame flag bit compared is effective status, indicates that this candidate frame needs retain, no Then be the mark position of the frame it is invalid, the comparison of subsequent distance is no longer participate in, when traversing the last of queue by comparison other At one, comparison other, i.e., the effective candidate frame of next flag bit of comparison other just now are replaced.

C) the effective candidate frame of all flag bits is extracted from result memory, and generates a marking frame printing Into original image as final testing result.

The invention has the following advantages:

One, the present invention is using the memory on fpga chip as the data buffer zone of convolutional calculation, depositing outside fpga chip Reservoir as main storage equipment, each convolutional layer by piece outside memory be coupled, this design method is more than Suitable for YOLO network, it is equally applicable to other neural network.

Two, the resource allocation methods for each layer convolutional calculation that the present invention is carried out can play to greatest extent whole network simultaneously Row calculate ability, compare with serial convolutional calculation structure, the Resources on Chip that the design of this programme uses is less, forward inference Speed faster.

Three, on fpga chip, without direct data interaction between each layer, the connection of each layer belongs to loose coupling relation, It can guarantee the stability of system in this way.

Four, the calculating for accelerating whole network present invention uses simple version is carried out without using the method for overlapping area It calculates but is simplified with the central point distance of two frames, can greatly improve the speed of NMS step.

Detailed description of the invention

Fig. 1 is the calculating structure and storage organization schematic diagram of each layer of the present invention

Fig. 2 is single layer network calculation flow chart of the present invention

Specific embodiment

Embodiment 1

(3) position maps.

A) quantify the weight data of former network: when being quantified according to a certain decimal position of 8bit fixed-point number, first building The decimal value table of comparisons under the position, i.e. 256 kinds of decimal numbers are found, wherein including positive zero and negative zero, then using former nearby Then initial data is quantified, after quantization although value changes, but data are still the floating type of 32bit, convenient for below It is calculated in GPU, obtains the detection accuracy of such quantization scheme, obtained after then traversing 8 kinds of scaling positions to detection Precision influences the smallest scaling position, eventually forms the weight quantization scheme of this layer；

B) all integrated test input feature vector figures are first normalized with the distribution of 0-1, then using described in step a) Method carry out the quantization of this layer of input feature vector figure；

Specifically, in the step (2) each layer of convolutional network calculating process are as follows: first from DRAM read epicycle calculate Required weighted data is placed into weight buffer BRAM；Then feature diagram data (FM) of this layer to convolution is read, it is complete Start to carry out convolutional calculation after preparing at all input datas, after the completion of a wheel convolutional calculation, in results buffer BRAM Data upload in DRAM again, empty interim findings data, then start the calculating of next round.Due to next layer of calculating One layer of calculated result is relied on, is mutually waited to allow every layer can be calculated simultaneously, table tennis is used in DRAM Structure, to play the computation capability in FPGA.On fpga chip, without direct data interaction, each layer between each layer Connection belong to loose coupling relation, can guarantee the stability of system in this way.

Specifically, when carrying out first layer convolution in the step (2), input feature vector figure triple channel is loaded first from DRAM In a progress convolutional calculation, since BRAM is resource-constrained on fpga chip, and the size of this layer of picture is bigger, therefore only Continuous several rows in first Loading Image, according to the principle of convolutional calculation, the convolution results of this several row are final output at this time The interim findings of corresponding region (that several row of same position) in a certain channel arrive phase calculating behind the channel of switching input When with convolution at position, need to add up with interim findings before, so before this layer of module executes convolutional calculation, The interim convolution calculated result first before fetching in DDR from same position corresponding to the output channel is needed, in this way every After secondary convolution module calculates result, it can be added and be stored in result memory BRAM again with the value in result memory BRAM. The input feature vector figure needs loaded every time can just switch next input feature vector region after having been calculated one time with all convolution kernels.

The step (2) needs to receive the data read out in DRAM as the BRAM of data buffer, in order to play The maximum bandwidth of DRAM, the write-in end of BRAM are set as 512 data widths, and depth design is 512 points, one piece of BRAM consumption 7.5 RAMB36E1, output minimum are set as 16bit, the input width as convolution operation；The BRAM of buffer as a result It to be not only also written in DRAM from data are read in DRAM simultaneously, therefore the dual-port mode that comes true is set, port width is 16bit；The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.

Specifically, in the step (2) feature diagram data storage scheme are as follows: for Input Data Buffer, level 1 volume Product one piece of BRAM of Web vector graphic, 2-6 layer of every layer of convolutional network respectively use two pieces of BRAM, the 7th layer use eight pieces of BRAM, the 8th Layer uses ten pieces of BRAM, and the 9th layer uses nine pieces of BRAM；For output data buffer, every layer uses one piece of BRAM；Every piece of BRAM 7.5 RAM36E1 are consumed, total characteristic pattern data buffer needs 337.5 RAM36E1.Since BRAM is resource-constrained, only exist Do ping-pong operation at output buffer, data of each layer in input buffer be not ready to before without convolutional calculation. According to every layer of multiply-add calculation amount come etc. ratios distribution every layer of parallel computation port number and every layer of corresponding parallel channel number It is shown in Table 1, for the parallel channel of input, each channel requires one piece of individual BRAM and stores, but result cache Device only needs the big BRAM such as a piece of to store.

The distribution of every layer of calculation amount ratio of 1 parallel computation of table and every layer of parallel channel number

Layer	One	Two	Three	Four	Five	Six	Seven	Eight	Nine
										Ratio	1	2.5	2.5	2.5	2.5	2.5	10	20	1
PE quantity	1	2	2	2	2	2	8	16	1

Specifically, conventional part is followed by the region layer operation of position mapping, and the output of the convolutional network includes 13* The location information of the location information of 13*5 candidate frame, each candidate frame is made of x, y, w, h value, respectively indicates candidate frame center The abscissa relative value of point, ordinate relative value, width relative value, height relative value；This four values are needed by some processing It can just be mapped in actual Pictures location, horizontal, ordinate relative value passes through sigmoid Function Mapping into absolute coordinate, Due to output the result is that 8bit fixed-point representation, corresponding output result can quantify into a look-up table, accelerate this Mapping process；Wide, high coordinate relative value is mapped in absolute value by e index, and the form of equally applicable look-up table obtains result.

The output candidate frame of the convolutional network specifically calculates step with confidence information for carrying out NMS operation It is as follows: first to extract the center point coordinate of each candidate frame in order, and flag bit is set to each candidate frame, for indicating entire Whether candidate frame retains；Due to using central point distance to be judged for index, according to prior information it is found that in network output In candidate frame, the closer frame of sequence is the object compared, and sequence is farther away can be ignored and compare, and then selects first candidate frame to be By the central point distance of comparison other candidate frame behind comparison other computational domain, when more than a threshold value, compared that Candidate frame flag bit be effective status, indicate this candidate frame needs retain, be otherwise the mark position of the frame it is invalid, It is no longer participate in the comparison of subsequent distance, when traversing the last one of queue by comparison other, replaces comparison other, i.e., just now Comparison other the effective candidate frame of next flag bit；Finally the effective candidate frame of all flag bits from result memory In extract, and generate a marking frame print in original image as final testing result.

The above embodiments and description only illustrate the principle of the present invention, is not departing from spirit of that invention and model Under the premise of enclosing, various changes and improvements may be made to the invention, these changes and improvements both fall within claimed invention model In enclosing.The scope of the present invention is defined by the appended claims and its equivalents.

Claims

1. a kind of YOLO network forward inference accelerator design method based on FPGA, the accelerator include fpga chip and DRAM, the memory BRAM in the fpga chip is as data buffer, and the DRAM is as main storage, in DRAM It is middle to use ping-pong structure；It is characterized in that, the accelerator design method the following steps are included:

(1) 8bit is carried out to former network data and pinpoints quantification, obtained influencing the smallest scaling position to detection accuracy, be formed Quantization scheme, the quantizing process successively carry out；

(3) position maps.

2. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature It is, the quantizing process of a certain layer in the step (1) are as follows:

A) quantify the weight data of former network: first establish 8bit fixed-point number a certain scaling position it is corresponding 256 kind ten into The value of system reuses nearby principle and quantifies to initial data, the numerical value after quantization still uses wherein including positive zero and negative zero The floating type of 32bit is indicated to calculate, and obtains the detection accuracy of such quantization scheme, then traverses 8 kinds of decimal points It obtains influencing the smallest scaling position to detection accuracy behind position, eventually forms the weight quantization scheme of this layer；

B) distribution of 0-1 input feature vector figure is first normalized, then using method described in step a) carry out this layer it is defeated Enter the quantization of characteristic pattern；

C) using the characteristic pattern after quantization in the step b) as input, the forward direction for only carrying out all pictures to this layer of convolution is passed It broadcasts, parameter is loaded into using the 32bit value after quantization, input quantity of the obtained output quantity as next layer network.

D) according to step a)-c) the method alternately quantifies every layer of weight and input feature vector figure, finally obtain all layers Quantization scheme.

3. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature It is, the calculating process of each layer of convolutional network in the step (2) are as follows:

C) convolutional calculation is carried out, after the completion of a wheel convolutional calculation, the data in BRAM is uploaded in DRAM again, empties and faces When result data, then start the calculating of next round.

4. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature Be, when carrying out first layer convolution in the step (2), first from one loaded in DRAM in input feature vector figure triple channel into Obtained convolution results are added in the convolutional calculation after switching input channel by row convolutional calculation, and the input loaded every time is special Sign figure needs can just switch next input feature vector region after having been calculated one time with all convolution kernels.

5. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature It is, further includes needing to carry out pondization operation when calculating the final result for arriving a certain output channel and swashing in the step (2) Operation living, detailed process is as follows, when calculating convolution results one by one to certain a line, this line result is divided two-by-two, and two Maximum value is recorded in a value, is saved using the logical resource on fpga chip, when next line is arrived in calculating, together Sample divides output result two-by-two, the larger value therein is taken, and be compared with the maximum value elected in lastrow, this That bigger value is as the maximum value in a certain region 2*2 in two maximum values, then with the threshold value of RELU activation primitive It is compared, result is saved in BRAM, in this way after carrying out the convolution of final result to a certain channel of output, while Complete the pondization operation and activation operation in the channel.

6. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature Be, step (2) a) and b) in BRAM be set as 512 data widths, depth design is 512 points, and one piece of BRAM disappears 7.5 RAMB36E1 are consumed, output minimum is set as 16bit；C) the dual-port mode that comes true is arranged in the BRAM in, and port width is 16bit；The expense of data storage is characterized figure and weight two parts in entire convolutional network, amounts to 425 RAM36E1.

7. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature It is, the storage scheme of weighted data in the step (2) are as follows: 1-3 layers of convolutional network share one piece of BRAM, consume 7.5 RAM36E1；4-8 layers of every layer of convolutional network respectively uses one piece of BRAM, each BRAM to consume 14.5 RAM36E1；9th layer makes With one piece of BRAM, 7.5 RAM36E1 are consumed.

8. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 3, feature It is, the storage scheme of feature diagram data in the step (2) are as follows: for level 1 volume one piece of Web vector graphic of product in a) and b) BRAM, 2-6 layers of every layer of convolutional network respectively use two pieces of BRAM, and the 7th layer uses eight pieces of BRAM, and the 8th layer uses ten pieces of BRAM, 9th layer uses one piece of BRAM；One piece of BRAM is used for every layer in c)；Every piece of BRAM consumes 7.5 RAM36E1.

9. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature Be, the output of the convolutional network includes the location information of 13*13*5 candidate frame, the location information of each candidate frame by x, Y, w, h value form, and respectively indicate the abscissa relative value of candidate frame central point, ordinate relative value, width relative value, height phase To value；For horizontal, ordinate relative value through sigmoid Function Mapping into absolute coordinate, wide, high coordinate relative value passes through e index It is mapped in absolute value.

10. a kind of YOLO network forward inference accelerator design method based on FPGA according to claim 1, feature It is, the output candidate frame of the convolutional network has confidence information, specific to calculate step such as carrying out NMS operation Under:

A) center point coordinate of each candidate frame is extracted in order, and flag bit is set to each candidate frame, for indicating entire Whether candidate frame retains；

B) selecting first candidate frame is behind comparison other computational domain by the central point distance of comparison other candidate frame, when more than one When a threshold value, that candidate frame flag bit for being compared is effective status, indicates that this candidate frame needs retain, otherwise The mark position of the frame be it is invalid, the comparison of subsequent distance is no longer participate in, when the last one for being traversed queue by comparison other When, replace comparison other, i.e., the effective candidate frame of next flag bit of comparison other just now；

C) the effective candidate frame of all flag bits is extracted from result memory, and generates a marking frame and prints to original As final testing result in figure.