CN111459877B

CN111459877B - Winograd YOLOv2 target detection model method based on FPGA acceleration

Info

Publication number: CN111459877B
Application number: CN202010254820.9A
Authority: CN
Inventors: 于重重; 鲍春; 谢涛; 常乐; 冯文彬
Original assignee: Beijing Technology and Business University; CCTEG China Coal Technology and Engineering Group Corp
Current assignee: Beijing Technology and Business University; CCTEG China Coal Technology and Engineering Group Corp
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2023-03-24
Anticipated expiration: 2040-04-02
Also published as: CN111459877A

Abstract

The invention discloses a Winograd YOLOv2 target detection model method based on FPGA acceleration, which adopts a PYNQ board card, wherein a main control chip of the PYNQ board card comprises a processing system end PS and a programmable logic end PL; the PS terminal caches a YOLO model and characteristic diagram data of an image to be detected; caching parameters of the YOLO model and an image to be detected into an on-chip RAM (random access memory) by a PL (provider) terminal, deploying a YOLO accelerator with a Winograd algorithm, completing model acceleration operation, forming a data path of a hardware accelerator, and realizing target detection of the image to be detected; the operation result of the accelerating circuit can be read out, and the image is preprocessed and displayed. By adopting the technical scheme of the invention, the calculation complexity of the YOLO algorithm can be reduced, the calculation time of the FPGA in accelerating the YOLO algorithm is shortened by the FPGA accelerator storage optimization algorithm, the target detection is accelerated, and the performance of the target detection is effectively improved.

Description

Winograd YOLOv2 target detection model method based on FPGA acceleration

Technical Field

The invention belongs to the technical field of computer vision and edge calculation, and relates to a design method of an FPGA accelerator for a target detection model.

Background

In recent years, with the development of machine vision and edge calculation, a target detection and recognition network model based on a deep learning network is greatly developed, and a large number of applications are realized in the fields of video scene monitoring, robot control, unmanned vehicles and the like. Representative models are single-shot-multibox-detection (SSD), fast R-CNN, and you-only-look-once (YOLO network model) series, wherein the YOLO algorithm has Faster and more accurate performance advantages.

Most of the target detection and identification models based on the deep learning network are completed in an image processor (GPU), and due to the fact that the number of parallelization computing Units is large, performance advantages shown in a convolutional neural network with a large number of repeated multiply-add operations are more prominent. However, the edge calculation needs to be performed on a small-sized, fast-operating, and low-power-consumption computing device, and therefore, the GPU is difficult to meet the above requirements. Application-specific integrated circuits (ASICs) and FPGAs are prominent in meeting edge computing requirements, and FPGAs have advantages of 1) high flexibility: the FPGA can execute any logic function which the ASIC can execute, and the special advantage is that the chip function can be changed at any time; 2) The development time is short: the FPGA can be directly programmed without stream slice; 3) The cost is low: compared with the cost of no need of tape-out of ASIC, the method is more suitable for small-scale use.

Suda et al propose a fixed-point convolutional neural network acceleration design using the OpenCL framework, and propose a systematic method to minimize execution time given the FPGA resource constraints. (Suda N, chandra V, dasika G, et al, thread-optimized OpenCL based FPGA accelerator for large-scale conditional neural networks [ C ]. Proceedings of the 2016ACM/SIGDA International Symposium on Field-Programmable Gate arrays. ACM.2016.16-25.)

The OpenCL acceleration system designed by Aydona et al reduces the multiply-accumulate operation of convolution by caching all intermediate features on a chip and utilizing a Winograd algorithm, thereby greatly improving the performance. (Ling A C, aydonat U, O' Connell S, et al ^TM SDK[C].the 5th International Workshop.ACM，2017.)

There are also many research works and achievements for the FPGA acceleration of the YOLO model, duy et al utilize the RTL circuit to realize the acceleration of the YOLO 2 model, perform binary weighting on parameters in the network, reduce DSP consumption in the FPGA acceleration, and reduce DRAM access and power consumption through data multiplexing and dynamic random access. (Nguyen D T, nguyen T N, kim H, et al. A High-through High and Power-Efficient FPGA Implementation of Yolo CNN for Object Detection [ J ]. IEEE Transactions on virtual target Scale Integration (VLSI) Systems, 2019.); nakahara et al designs a complete acceleration flow in combination with a binary network and a Support Vector Machine (SVM) in a lightweight YOLOv2 model, and achieves a good effect. (H.Nakahara, H.Yonekawa, T.Fujii, and S.Sato, "A lightweight Yolov2: A paired CNN with a parallel support vector regression for an FPGA," in Proc.ACM/SIGDA int. Symp.Field-Program.Gate Arrays, feb.2018, pp.31-40.)

The FPGA acceleration method based on the YOLO solves the problems of high power consumption, low speed and the like of target detection on edge computing equipment, but on-chip resources, bandwidth and power consumption of the FPGA are still the biggest challenges of the FPGA, and when the Winograd algorithm is introduced into the acceleration of the FPGA, the on-chip resources and the bandwidth are well utilized, and meanwhile, lower power consumption is guaranteed.

The FPGA accelerator design method based on the deep learning target detection model is a hot topic of edge calculation. However, in the existing accelerator design method, there are many problems such as unreasonable on-chip resource allocation and large power consumption, so that it is a very challenging technical task to realize high-efficiency and low-power consumption reasoning of the target detection model in the FPGA.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a Winograd YOLOv2 target detection model method based on FPGA acceleration, FPGA accelerator design is carried out on the Winograd-based YOLOv2 model, FPGA accelerator design of the YOLO model is realized on the basis of the existing YOLOv2 model acceleration and Winograd algorithm acceleration (Winograd algorithm is used for convolution optimization on a convolution kernel to reduce the calculated amount), a Winograd YOLO-based FPGA acceleration method is provided, the calculation complexity of the YOLO algorithm is reduced, an FPGA accelerator storage optimization algorithm is provided, the calculation time of the FPGA in accelerating the YOLO algorithm is shortened, target detection is accelerated, and the performance of target detection is effectively improved.

The invention adopts Winograd algorithm, aims to reduce the calculation amount by using the Winograd algorithm as the convolution optimization convolution kernel, and provides a new cache scheduling method, namely a cache pipeline, so as to reduce the model inference time.

The technical scheme of the invention is as follows:

a Winograd YOLOv2 target detection method based on FPGA acceleration adopts a PYNQ (Python production for Zynq) board card produced by XILINX to cache a YOLO network model and characteristic diagram data of an image, a data path of a hardware accelerator is formed, target detection of the image to be detected is realized, an operation result of an acceleration circuit can be read out, and image preprocessing and display are carried out;

the main control chip ZYNQ7020 of the PYNQ board includes two parts, which are a PS (Processing System) terminal and a PL (Programmable Logic) terminal. The method comprises the steps that a PS end controls to cache a YOLO model and an image to be detected, then parameters of the YOLO model and the image to be detected are cached in an on-chip RAM (Random Access Memory) of a PYNQ board card at a PL end, a YOLO accelerator with a Winograd algorithm is designed and deployed, a cache pipeline is adopted in a scheduling strategy, accelerated operation of the model is completed, and a data path of the whole hardware accelerator is formed. Finally, reading out the operation result of the PL end model by using an AXI (Advanced eXtensible Interface) at the PS end, and carrying out image preprocessing and display at the PS end;

the Winograd YOLOv2 target detection model method based on FPGA acceleration specifically comprises the following steps:

A. training a target detection network model:

a YOLOv2 target detection network model (Molchanov V, vishnyakov B V, vizilter Y V, et al. Peer detection in video detection with fusion connected virtual YoLO neural network [ C ]// SPIE Optical Metal.2017: 103340Q.DOI 10.1117/12.2270326) is selected, training is completed, and weight value weight of the YOLOv2 target detection network model is obtained.

B. B, performing Low-position Fixed-Point (Low-Bit Fixed Point) on the well-trained YOLOv2 target detection network model in the step A;

as shown in fig. 2, most of the formats for data storage in a computer are 32-bit floating point numbers, where the 32-bit floating point number includes a sign bit (S), a level code bit (M), and a tail bit (M), where the level code bit is an integer part of the floating point number and the tail bit is a decimal part of the floating point number. The difference between the fixed point number and the floating point number is that the decimal point of the fixed point number is fixed, so that the storage space in the FPGA can be greatly reduced, the calculated amount is reduced, and the specific process is as follows:

B1. the optimal fixed point quantization method for obtaining the YOLOv2 target detection network model comprises the following steps:

the optimal fixed-point parameter (tail code M) is determined by comparing the difference between the square sums of the network parameters before and after quantization _min ) As shown in equation (1):

wherein, W _float An arbitrary weight parameter of a certain layer of a YOLOv2 target detection network model is represented as an original floating point value, and W (bw, M) represents that W is to be processed under the condition of a given bit width bw and a code M _float Converted back to the new floating point number W 'of the floating point after fixed point conversion' _float . The quantization of the bias parameter bias is similar and not described in more detail here.

B2. And B, acquiring the YOLOv2 network layer number R, executing the step B3 and repeating the step R times.

B3. Reading the weight of the current layer of the YOLOv2 network, and respectively performing fixed-point processing on the weight value and bias parameter value to obtain fixed-point model parameters; specifically, the 32-bit floating point number is changed into a 16-bit fixed point number (1-bit sign bit, M-bit code order bit, (16-M-1) bit tail bit).

B4. And testing the current model parameters according to the fixed-point model parameters obtained in the step B3, and verifying the accuracy of the model.

B4.1 16492 images were randomly selected as a test set from a PASCAL VOC0712 (PASCAL: pattern Analysis, statistical modeling and comparative Learning, VOC: visual Object Classes) dataset.

And B4.2, loading the fixed-point model parameters into a YOLOv2 target detection model, and carrying out forward reasoning on the model.

B4.3 calculating the map of the model (mean average accuracy) from the inference results

C. An FPGA accelerator for YOLOv2 was designed.

The convolution layer is complicated in calculation and large in data quantity, so that calculation time is long, calculation resource consumption is huge, a YOLOv2 convolution kernel with a Winograd algorithm is designed at a PL end, a large number of multiplication operations are replaced by addition operations realized by the Winograd algorithm during convolution operation, multiplier resources consumed by calculation convolution are reduced, and the multiplier utilization rate of an FPGA (field programmable gate array) is reduced under the condition that high precision is guaranteed.

The Winograd algorithm has a remarkable effect of reducing the amount of calculation for convolution calculation with a smaller convolution kernel size. For the YOLOv2 algorithm, the convolution is 3 × 3 and 1 × 1, the convolution kernel size is small, and the Winograd algorithm is suitable for accelerating the convolution operation. The Winograd algorithm computes an m-dimensional profile output of a convolution kernel F (m, r) with a convolution kernel size r by using m + r-1 multiplications. Formula (2) shows that the convolution kernel size is 3 dimensions, and the convolution operation is performed by using Winograd minimum filter algorithm under the condition that the output vector is 2 dimensions, wherein d _i Representing input feature map data in image convolution operations, d _i Representing convolution kernel data, m _i Representing the output data.

m ₀ ＝(d ₀ -d ₂ )g ₀

/>

m ₃ ＝(d ₁ -d ₃ )g ₂

The Winograd algorithm inputs image data of m + r-1 pixels and outputs a vector of m dimensions. In equation (2), 4-pixel image data is input, and a 2-dimensional vector is output. Since the algorithm performs 4 times of addition operation of input data, 3 times of addition operation of convolution kernel and 4 times of addition operation of multiplication data, the number of times of addition operation is increased, but the number of times of multiplication operation is reduced from the original 6 times to 4 times, so that the Winograd algorithm replaces multiplication operation by addition (Liu X, pool J, han S, et al. Efficient Sparse-Winograd conditional Neural Networks [ J ]. 2018.), and the method is used in the accelerator of YOLOv2, and the specific process is as follows:

C1. input transform (Input transform): transforming the feature map data (convolution input In) taken from the buffer, wherein after m and r are determined, the values of output conversion matrixes A, B and G can be determined, in is convolution input, and a transformed feature matrix Transform (In) can be obtained through a formula (3):

Transform(In)＝B ^T InB (3)

C2. and (3) performing convolution kernel transformation (Filter Transform), wherein F is a convolution kernel parameter, and obtaining a convolution kernel transformation result Transform (F) through formula (4):

Transform(F)＝G ^T FG (4)

C3. obtaining a convolution result of Winograd through an Inverse transformation function, wherein E is a convolution output result, and obtaining a convolution calculation result Inverse _ Transform (E) through a formula (5):

Inverse_Transform(E)＝A ^T EA (5)

convolution module design of C4.YOLOv2 network model

The flow of C4.1 reading convolution operation data is prepared for YOLOv2 convolution, and the convolution calculation data flow designed by the present invention is shown in fig. 6:

and (4) storing the Input Feature Map (Input Feature Map) entering the convolutional layer operation in an On-chip buffer (On-chip buffer), and storing the parameter file of the model obtained in the step (B3) in the convolutional buffer. And before the N characteristic diagrams enter a Winograd PE operation unit, unfolding the characteristic diagrams to obtain characteristic diagram vectors, and grouping the vectors. In a Winograd operation unit, the Feature Map vector and a convolution kernel are subjected to multiplication and addition operation, the convolution result of each Feature Map can be finally obtained, features are fused by an accumulator ACC unit, the calculation result is stored in an Output Feature Map (Output Feature Map) cache region, and the convolution operation in the next process is waited for reading.

C4.2 construction of Winograd PE (Processing Element arithmetic Unit)

The Winograd PE designed by the invention is divided into three parts to respectively transform the characteristic diagram and the convolution kernel entering the convolution unit, and finally, the operation is carried out, wherein the internal design is shown in figure 3. The process can be divided into three steps:

c4.2.1 transforming the characteristic diagram data from the buffer, when m and r are determined, the values of transformation matrix A, B and G can be determined, so that the transformed input transformed characteristic matrix U can be obtained by formula (3);

c4.2.2 when the feature map conversion is completed, taking out the convolution kernel parameters stored in the buffer area, and obtaining a feature matrix V after the convolution kernel conversion by using formula (4) transformation;

c4.2.3 transmits the U, V matrix obtained in the above steps to a PE operation unit, and performs a dot product operation to obtain an M matrix, so as to obtain a calculated output result. Where N denotes the number of input feature maps (number of channels), M denotes the number of output feature maps (number of channels), and H × H denotes the size of the convolution kernel.

And when the data of the feature map and the convolution kernel enter the PE operation unit for accelerated operation, expanding and grouping the feature map data and the convolution kernel data. In one conventional convolution operation, 6 cycles need to be executed, and after a Winograd algorithm is added, loop-5 and Loop-6 can be saved, so that multiplier consumption caused by cycle operation is saved in an FPGA.

Cache optimization and specific time calculation of D.PL cache pipeline

D1. Aiming at FPGA acceleration, the invention firstly provides a method for caching Pipeline Buffer Pipeline (a single Buffer set is improved into a multi-Buffer structure) to carry out FPGA acceleration. The specific process is as follows

D1.1 in the logic part of ZYNQ, the data exchange is carried out with the CPU through an external storage DDR DRAM, and the DDR is controlled by an on-chip bus AXI when being in data exchange with an accelerator.

D1.2 instantiates a FIFO interface behind the AXI bus, thereby ensuring that data input and output to the accelerator operation unit can be transmitted at high speed and high frequency. The Buffer cache cluster is added at the input interface of the accelerator operation unit so as to wait for the sign graph and the convolution kernel conversion operation, and the data cache pipeline architecture provided by the invention is shown in fig. 4.

D1.3, in the data input part of the accelerator, dividing an input Buffer cache cluster (sets) into a plurality of groups (such as Buf _ In1, buffer _ In2 and Buffer _ In 3), correspondingly dividing an output Buffer cluster into a plurality of groups (such as Buf _ Out1, buffer _ Out2 and Buffer _ Out 3), and forming a cache pipeline structure. When normal data interaction and transmission are ensured, the pipeline structure can fully utilize the advantages of each Buffer, and the storage capacity of each Buffer can be utilized to the maximum extent in the transition period of the clock bus CLK.

Through the steps, winograd YOLOv2 target detection based on FPGA acceleration is realized, and the target in the image to be detected is quickly obtained.

D2. Specific time calculation

D2.1 calculating total time consumption of FPGA for finishing one-time operation

The input data time of each Buffer is recorded as T _in The time of each time the data in Buffer enters PE unit for operation is marked as T _co And the time for taking out the cache by the Buffer after the operation of the acceleration unit is finished is recorded as T _out The time for completing the whole task flow is recorded as T _task . Setting the number of tasks completed in the acceleration unit as n and T _in ≠T _co ≠T _out (three operations are equal in time and do not affect the result either). Time T for completing all tasks according to the time sequence of the conventional access structure _sum As shown in equation (12).

T _sum ＝n×T _task ＝n×(T _in +T _co +T _out ) (12)

D2.2 obtaining improved pipeline storage optimization time

The Buffer Pipeline structure provided by the invention improves a single Buffer set into a three-Buffer structure, and carries out three-stage flowing water on the structure, and the structure is provided with

Since the total task can be divided into three stages, when n tasks are completed, the total time T consumed _{BP_sum} As shown in equation (13).

A timing chart of the conventional calculation and the Buffer Pipeline structure proposed herein is shown in fig. 7, where the number of tasks n =3, and the time taken to complete the task in the conventional calculation method is shown in formula (14).

T _sum ＝3×T _task ＝3×(T _in +T _co +T _out ) (14)

When the buffering process is performed with Buffer Pipeline, the time taken to complete the entire task is as shown in equation (15).

The inequality property shows that:

thus, there is T _sum ＞T _{BP_sum} It can be seen that the time T saved by the method proposed by the invention _save As shown in equation (17).

T _save ＝T _sum -T _{BP_sum} (17)

＝n×(T _in +T _co +T _out )-{T _in +max(T _in ,T _co )+max(T _in ,T _co ,T _out )×[n-(3-1)]+max(T _co ,T _out )+T _out }

Compared with the prior art, the invention has the beneficial effects that:

(1) When the FPGA accelerates the YOLO algorithm, the Winograd algorithm is introduced into the YOLOv2 model, and because a large number of convolution operations exist in the YOLOv2 model, when the convolution operations are realized by a high-level synthesis (HLS) tool, a large number of multiplication operations in circulation are replaced by the addition operation realized by the Winograd algorithm, so that multiplier resources consumed by calculating the convolution are reduced, and the utilization rate of the multiplier of the FPGA is reduced under the condition that the realization precision is 78.25%.

(2) In order to improve the efficiency of data caching and processing, the invention provides a new cache scheduling method, namely a cache Pipeline (Buffer Pipeline), which carries out Pipeline optimization processing on a data cache entering the convolution operation of an accelerator each time, and can reduce the required time under the condition of finishing the same calculation task finally through time sequence analysis.

(3) The YOLOv2 accelerator based on the PYNQ framework is provided, the acceleration of the rolling and pooling operations of each layer of the YOLOv2 is realized by utilizing the characteristics of low power consumption and high parallelism of a ZYNQ type FPGA, the data is subjected to fixed point processing, a weight 32-bits floating point number is fixed to be 16-bits data, the power consumption is reduced to 2.7w, and the problem of high power consumption of an embedded end for realizing deep learning target detection and identification models is solved.

Drawings

FIG. 1 is a flow chart of an accelerated optimization method of a YOLOv2 target detection model based on a PYNQ platform.

FIG. 2 is a schematic diagram of floating-point transformation to fixed-point transformation of model parameters;

wherein (a) is a 32-bits floating point number and (b) is a 16-bits fixed point number.

FIG. 3 is a schematic structural diagram of a YOLOv2 accelerator Winograd PE.

FIG. 4 is a schematic diagram of an internal structure of an accelerator based on cache pipeline optimization.

FIG. 5 is a diagram of network accuracy variation under different fixed-point conditions;

wherein (a) represents the change in size of the model YOLOv2, tiny-YOLO and Sim-YOLO under the 32-bit, 16-bit and 8-bit parameter types, respectively; (b) Representing the variation in precision of the model YOLOv2, tiny-YOLO and Sim-YOLO under the 32-bit, 16-bit and 8-bit parameter types, respectively.

FIG. 6 is a schematic view of the YOLOv2 accelerator data flow.

FIG. 7 shows the timing variation between the case where no cache Pipeline is added to the arithmetic unit of the accelerator and the case where the cache Pipeline is added to the arithmetic unit of the accelerator, wherein the Buffer Pipeline method can save time when three tasks are executed; where Buffer In, computer and Buffer Out represent the three stages of completing a computational task.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The overall hardware architecture of the accelerator designed by the invention is shown in fig. 1, firstly, training of a YOLOv2 model is completed on an upper computer, a VOC data set (VOC 2007+ 2012) is used, 16551 pictures are randomly selected as a training set, and 16492 pictures are selected as a test set. And then performing a model fixed-point task, completing an edge algorithm at an embedded end, integrating an ARM core at a PS end and carrying a Linux operating system, reserving a Python language environment when the operating system is transplanted, controlling all interfaces between the PS and the PL by a CPU (Central processing Unit), inputting a characteristic diagram of the YOLO model into a DDR (double data rate) cache by an accelerator through CPU scheduling, interacting with peripheral circuits of the operating system through a bus, reading an operation result of an accelerating circuit by the CPU through an AXI (advanced extensible interface) bus, and performing image preprocessing and display at the PS end.

In a PL logic part, caching data in an external storage DDR into an on-chip RAM, laying out and wiring a convolution and pooling circuit of a YOLO accelerator in an FPGA, finally transmitting a hardware design bit file (Bitstream) and a design instruction file (Tcl) to an Overlay of an operating system, analyzing a hardware circuit and an IP core operation circuit of the YOLO in the Overlay, and finally forming a data path of the whole hardware accelerator. FIG. 1 is an overall architecture of an accelerator

The invention is further described below by way of example according to the following steps:

training of the Yolov2 target detection model, table 1 for Yolov2 model parameter configuration

Table 1 YOLOv2 model parameter configuration adopted in the embodiments of the present invention

In table 1, C represents a convolutional layer; m represents a pooling layer;

2. carrying out Low-position Fixed-Point (Low-Bit Fixed Point) on the YOLOv2 model in the step 1, and executing the following operations:

2.1 obtaining the best fixed point quantization method of network, comparing the difference of the parameter square sum of each parameter of network before and after quantization to determine the best fixed point quantization parameter (tail code M) _min )；

2.2 obtaining the YOLOv2 network layer number R, repeating the process for 2.1 times Q times;

2.3 reading the weight of the current layer, fixing the weight and bias, respectively fixing the weight and bias, changing the 32-bit floating point number into 16-bit fixed point number, including: 1bit sign bit, M _min bit code order, (16-M) _min -1) bit tail bits;

2.4 testing the fixed-point model, including the following processes;

2.4.1 randomly selected 16492 from the VOC data set (VOC 2007+ 2012) as the test set.

And 2.4.2, loading the fixed-point model parameters into a YOLOv2 target detection model, completing operations such as convolution, pooling and the like, and completing forward reasoning of the network.

2.4.3 calculating the map of the model (mean average accuracy) from the inference results

In the process of data spotting, the storage occupied by the network model is also reduced, compared with the original precision model, in the 16-bit spotting, the size of the YOLOv2 model is reduced by 7 x, and the size of the Tiny-YOLO and the size of the Sim-YOLO are respectively reduced by 4 x and 4.2 x. Meanwhile, compared with the original precision model, in 8-bit fixed-point processing, the size of the YOLOv2 model is reduced by 20 x, and the sizes of the Tiny-YOLO and the Sim-YOLO are respectively reduced by 8 x and 12 x, as can be seen from FIG. 5, through 16-bit fixed-point processing, the precision of the YOLOv2 model can be ensured, and the size of the model can also be reduced.

3. Designing an FPGA accelerator for YOLOv 2;

the YOLOv2 accelerator data flow is shown in fig. 6 and includes the following processes:

3.1 Input transform: converting the feature map data fetched from the buffer;

3.2, performing convolution kernel conversion (Filter transform) to obtain a convolution kernel conversion result;

3.3 obtaining a convolution result of Winograd through an inverse transformation function;

designing a 3.4YOLOv2 convolution module, and constructing a Winograd PE operation unit;

3.4.1 flow of reading convolution operation data, to prepare for YOLOv2 convolution;

3.4.2 transforms the feature map data retrieved from the buffer, and when m and r are determined, the values of both transformation matrices A, B and G may be determined. As shown in equation (18):

Out＝A ^T [(GFG ^T )⊙(B ^T InB)]A (18)

3.4.2.1 Input transform: the U bits are convolved into the input, from which the transformed feature matrix U is obtained by equation (19):

U＝B ^T InB (19)

3.4.2.2 convolution kernel transform (Filter transform), where F is a convolution kernel parameter, by equation (20), we obtain the result V of the convolution kernel transform:

V＝G ^T InG (20)

3.4.2.3 transfers the U, V matrix obtained in the 3.4.2.1 step and 3.4.2.2 to the PE operation unit, and performs dot product operation according to the formula (18) to obtain an output result Out matrix.

And 4, storage optimization and specific time calculation of the PL cache pipeline comprise the following processes:

4.1 storage optimization steps for PL cache pipelines are as follows:

4.1.1 setting data exchange mode: the data exchange is carried out with the CPU through an external memory DDR DRAM, and the DDR is controlled by an on-chip bus AXI when the DDR exchanges data with an accelerator.

4.1.2 instantiate a FIFO interface behind the AXI bus to ensure that data input and output to the accelerator operation units can be transferred at high frequency and at efficient speed. And adding a Buffer cluster at the input interface of the accelerator operation unit so as to convert the data into a format and wait for time.

4.1.3 in the accelerator input data section, the input buffer clusters (sets) are divided into: buffer _ In1, buffer _ In2 and Buffer _ In3, and dividing the output Buffer cluster into: buf _ Out1, buffer _ Out2, and Buffer _ Out 3. When normal data interaction and transmission are ensured, the pipeline structure can fully utilize the advantages of each Buffer, and the storage capacity of each Buffer can be utilized to the maximum extent in the transition period of the clock bus CLK.

4.2FPGA computation time calculation

4.2.1 obtaining the total time consumption of FPGA to finish one-time operation

4.2.2 obtaining the improved pipeline storage optimization time, pipelining the operations of reading the characteristic diagram, performing convolution calculation, writing the characteristic diagram and the like, and completing a plurality of operations in the same clock period, wherein T _sum To optimize the time required for pre-operation, T _{BP_sum} Optimizing the total time for adopting running water, wherein T _save To save time, as shown in fig. 7.

5. Overall YOLOv2 accelerator performance evaluation

The Winograd algorithm parameter used by the convolution layer of the YOLOv2 is F (2 x 2,3 x 3), an improved experiment is carried out on the convolution, a YOLO acceleration IP core is generated in Vivado HLS in a debugging mode, a hardware bit file and a parameter file are generated in Block design, an operating system of the PS schedules hardware logic and allocates acceleration resources, data quantization processing is carried out before model parameters enter the FPGA, and the data are quantized uniformly into fixed 16-bits type data. The average time for processing each picture by the final accelerated platform test is 124ms, and the detection average precision is 78.25%.

Comparing the acceleration of the PYNQ platform-based accelerator provided by the invention with the acceleration of other platforms, as shown in Table 2, compared with a GPU platform, the PYNQ platform-based accelerator is not only not reduced in precision, but also greatly improved in power consumption. In comparison with an accelerator implemented on a Zynq Ultrascale + platform, the accelerator based on the PYNQ platform has the advantages that the number of adders is increased after a Winograd algorithm is introduced, but the number of DSPs is obviously reduced, and the whole resource consumption is reduced. In the experiment, the accuracy is improved because the YOLOv2 network model selected by the invention has higher accuracy compared with the simplified YOLO model such as the Tiny YOLOv2 model.

TABLE 2 hardware implementation of the YOLO model herein vs. other method Performance

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of this disclosure and the appended claims. Therefore, the invention should not be limited by the disclosure of the embodiments, but should be defined by the scope of the appended claims.

Claims

1. A Winograd YOLOv2 target detection method based on FPGA acceleration adopts a PYNQ board card, wherein a main control chip of the PYNQ board card comprises a processing system end PS and a programmable logic end PL; the PS terminal caches the YOLO model and the characteristic diagram data of the image to be detected; caching parameters of the YOLO model and an image to be detected into an on-chip RAM (random access memory) by a PL (provider) terminal, deploying a YOLO accelerator with a Winograd algorithm, completing model acceleration operation, forming a data path of a hardware accelerator, and realizing target detection of the image to be detected; the operation result of the accelerating circuit can be read out, and image preprocessing and display are carried out;

the method comprises the following steps:

A. training a YOLOv2 target detection network model, and acquiring a weight value weight of the YOLOv2 target detection network model;

B. b, performing low-position spotting on the well-trained YOLOv2 target detection network model in the step A; the specific process is as follows:

B1. the optimal fixed point quantization method for obtaining the YOLOv2 target detection network model comprises the following steps: the optimal fixed-point parameter, namely the tail code M is determined by comparing the difference of the square sums of all the parameters of the network before and after quantization _min ；

B2. Acquiring the network layer number R of a YOLOv2 target detection network model;

B3. acquiring the weight of each layer of the YOLOv2 network, and performing fixed-point processing on the weight value weight and the bias parameter value bias to obtain fixed-point model parameters;

B4. testing the current model parameters according to the fixed-point model parameters obtained in the step B3, and verifying the accuracy of the model;

C. designing an FPGA accelerator for YOLOv2, and using a method of replacing multiplication operation with addition by a Winograd algorithm in the accelerator for YOLOv2, wherein the method comprises the following steps:

designing a YOLOv2 convolution kernel with a Winograd algorithm at a PL end, and converting a large number of multiplication operations into addition operations realized by the Winograd algorithm during convolution operations; accelerating convolution operation by adopting a Winograd algorithm, wherein the Winograd algorithm calculates m-dimensional characteristic diagram output of a convolution kernel F (m, r) with a convolution kernel size of r by using m + r-1 times of multiplication; the input of the Winograd algorithm is image data of m + r < -1 > pixels, and the output is a vector of m dimension; the method of replacing multiplication operation with addition method by Winograd algorithm is used in the accelerator of YOLOv2, and comprises the following steps:

C1. transforming the characteristic diagram data obtained from the buffer by input transformation to obtain a transformed characteristic matrix Transform (In), wherein In is convolution input;

C2. obtaining a convolution kernel conversion result Transform (F) through convolution kernel conversion, wherein F is a convolution kernel parameter;

C3. obtaining a convolution calculation result Inverse _ Transform (E) of Winograd through an Inverse transformation function, wherein E is a convolution output result;

C4. a convolution module for designing a YOLOv2 network model, comprising:

c4.1 designing a convolution calculation data stream, and reading a flow of convolution calculation data;

c4.2, constructing a Winograd PE operation unit; dividing a Winograd PE operation unit into three parts, respectively transforming the characteristic diagram and the convolution kernel entering the convolution unit, and then performing operation; the method comprises the following steps:

c4.2.1 converting the characteristic diagram data from the buffer to obtain a converted characteristic matrix U;

c4.2.2 extracting the convolution kernel parameters stored in the buffer area, and obtaining a converted characteristic matrix V through conversion;

c4.2.3 transmitting the matrix U, V obtained in the above step to an arithmetic unit for dot product operation to obtain a matrix M and an output result, wherein M represents the number of output characteristic diagrams or the number of channels;

optimizing the storage of the PL cache pipeline;

D1. aiming at FPGA acceleration, a cache pipeline method is adopted to improve a single cache set into a multi-cache structure for FPGA acceleration; the process is as follows:

d1.1, in a logic part of ZYNQ, data interaction is carried out with a CPU through an external storage DDR DRAM; when the DDR exchanges data with the accelerator, the DDR is controlled by an on-chip bus AXI;

d1.2 instantiating a FIFO interface behind the AXI bus to enable data input and output to the accelerator operation unit to be transmitted at high speed and high frequency; adding a cache cluster at an input interface of an accelerator operation unit, converting the data into a format and waiting;

d1.3, in the data input part of the accelerator, dividing an input cache cluster into a plurality of parts, and correspondingly dividing an output cache cluster page into a plurality of parts to form a cache pipeline structure; when normal data interaction and transmission are ensured, each cache is fully utilized, and the storage capacity of each cache is utilized to the maximum extent in the hopping cycle of a clock bus;

2. The FPGA acceleration-based Winograd YOLOv2 target detection method as claimed in claim 1, wherein the total time consumed for the FPGA to complete one operation is calculated by the following method:

the input data time of each Buffer is recorded as T _in The time of each time the data in Buffer enters PE unit for operation is marked as T _co And the time for taking out the Buffer from the Buffer after the operation of the acceleration unit is finished is recorded as T _out The time for completing the whole task flow is recorded as T _task (ii) a Setting the number of completed tasks in the acceleration unit as n, and completing all tasks according to the time sequence of the conventional access structure _sum Represented by formula (12):

T _sum ＝n×T _task ＝n×(T _in +T _co +T _out ) (12)

the improved flow memory optimization time is calculated by adopting the following method:

improving a single cache set into a multi-cache structure, and carrying out three-level flow on the structure; is provided with

The total task is divided into three stages, and when n tasks are completed, the total time T consumed _{BP_sum} Represented by formula (13):

let the number of tasks n =3, the time taken to complete the task is represented by equation (14):

T _sum ＝3×T _task ＝3×(T _in +T _co +T _out ) (14)

when the buffering process is performed by the buffering pipeline, the time taken to complete the entire task is expressed by equation (15):

T _save ＝T _sum -T _{BP_sum}

＝n×(T _in +T _co +T _out )-{T _in +max(T _in ,T _co )+max(T _in ,T _co ,T _out )×[n-(3-1)]+max(T _co ,T _out )+T _out } (17)

saved time T _save Represented by formula (17).

3. The FPGA-acceleration-based Winograd YOLOv2 target detection method as claimed in claim 1, wherein the operation result of the PL terminal model is read out through an AXI bus of the PS terminal, and image preprocessing and display are performed at the PS terminal.

4. The FPGA-based accelerated Winograd YOLOv2 target detection method according to claim 1, wherein the step B1 is to obtain an optimal fixed-point quantization method of a YOLOv2 target detection network model, specifically to determine an optimal fixed-point quantization parameter, namely a tail code M, by comparing differences of square sums of various parameters of the network before and after quantization through a formula (1) _min ：

Wherein, W _float An arbitrary weight parameter representing a layer, the original floating-point value, W (bw, M) representing W, given a bit width bw and a level code M _float Converted back to the new floating point number W 'of the floating point after fixed point conversion' _float 。

5. The FPGA-acceleration-based Winograd YOLOv2 target detection method according to claim 4, wherein,step B3, reading the weight of the current layer of the YOLOv2 target detection network model, respectively fixing the weight value and the bias parameter value, and specifically changing the 32-bit floating point number into a 16-bit fixed point number comprising a 1-bit sign bit and M _min bit order, 16-M _min -1bit tail bits.

6. The FPGA-acceleration-based Winograd YOLOv2 target detection method according to claim 1, wherein the step B4 of testing current model parameters and verifying the accuracy of the model comprises the following steps:

b4.1 randomly selecting 16492 images from the VOC data set as a test set;

b4.2, loading the fixed-point model parameters into a YOLOv2 target detection model, and carrying out forward reasoning on the model;

and B4.3, calculating the average precision of the model according to the inference result.

7. The FPGA acceleration-based Winograd YOLOv2 target detection method according to claim 1, wherein step C designs an FPGA accelerator for YOLOv 2; the method comprises the following steps: designing a YOLOv2 convolution kernel with a Winograd algorithm at a PL end, accelerating convolution operation by adopting the Winograd algorithm, and calculating an m-dimensional characteristic diagram output with the convolution kernel size of r convolution kernel F (m, r) by using the Winograd algorithm v (F (m, r)) = m + r-1 times of multiplication; the convolution kernel size is 3 dimensions, and the convolution operation is expressed as formula (2) by using Winograd minimum filter algorithm under the condition that the output vector is 2 dimensions:

m ₀ ＝(d ₀ -d ₂ )g ₀

m ₃ ＝(d ₁ -d ₃ )g ₂

wherein d is _i Representing input feature map data in image convolution operations, d _i Representing convolution kernel data, m _i Representing the output data; the input of the Winograd algorithm is image data of m + r < -1 > pixels, and the output is a vector of m dimension; in equation (2), 4 pixels of image data are input, and a 2-dimensional vector is output.

8. The FPGA acceleration-based Winograd YOLOv2 object detection method of claim 7, wherein step C1 transforms the feature map data fetched from the buffer by input conversion:

determining values of output conversion matrices A, B and G from the m and r values; specifically, a transformed feature matrix Transform (In) is obtained by formula (3):

Transform(In)＝B ^T InB formula (3)

In step C2, a convolution kernel transformation result Transform (F) is obtained by formula (4):

Transform(F)＝G ^T FG (4)

Inverse_Transform(E)＝A ^T EA (5)

and step C3, specifically, obtaining a convolution calculation result Inverse _ Transform (E) through an Inverse transformation function by using the formula (5).