CN111459877A

CN111459877A - FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method

Info

Publication number: CN111459877A
Application number: CN202010254820.9A
Authority: CN
Inventors: 于重重; 鲍春; 谢涛; 常乐; 冯文彬
Original assignee: Beijing Technology and Business University; CCTEG China Coal Technology and Engineering Group Corp
Current assignee: Beijing Technology and Business University; CCTEG China Coal Technology and Engineering Group Corp
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-07-28
Anticipated expiration: 2040-04-02
Also published as: CN111459877B

Abstract

The invention discloses a Winograd YO L Ov2 target detection model method based on FPGA acceleration, which adopts a PYNQ board card, wherein a main control chip of the PYNQ board card comprises a processing system end PS and a programmable logic end P L, wherein the PS end caches a YO L O model and characteristic map data of an image to be detected, the P L end caches parameters of the YO L O model and the image to be detected in an on-chip RAM, a YO L O accelerator with a Winograd algorithm is deployed to finish model acceleration operation, a data path of a hardware accelerator is formed to realize target detection of the image to be detected, an operation result of an acceleration circuit can be read out, and image preprocessing and display are carried out.

Description

FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method

Technical Field

The invention belongs to the technical field of computer vision and edge calculation, and relates to a design method of an FPGA accelerator for a target detection model.

Background

Representative models are single-shot-multi-detection (SSD), FasterR-CNN and you-only-look-once (YO L O network model) series, wherein the YO L O algorithm has faster and more accurate performance advantages.

Most of target detection and identification models based on the deep learning network are completed in an image processor (GPU), and due to the fact that the number of parallel computing Units is large, performance advantages shown in a convolutional neural network with a large number of repeated multiply-add operations are more prominent. However, the edge calculation needs to be performed on a small-sized, fast-operating, and low-power-consumption computing device, and therefore, the GPU is difficult to meet the above requirements. Application-specific integrated circuits (ASICs) and FPGAs are prominent in meeting edge computing requirements, and FPGAs have advantages of 1) high flexibility: the FPGA can execute any logic function which the ASIC can execute, and the special advantage is that the chip function can be changed at any time; 2) the development time is short: the FPGA can be directly programmed without stream slice; 3) the cost is low: compared with the cost of no need of tape-out of ASIC, the method is more suitable for small-scale use.

Suda et al propose a fixed-point convolutional neural network acceleration design using the OpenC L framework, and propose a systematic method to minimize the execution time given the FPGA resource constraints (Suda N, Chandra V, Dasika G, actual.Throughput-optimized OpenC L based FPGA access for large-scale-scalable neural networks [ C ]. Proceedings of the 2016ACM/SIGDA interfacial neural network, ACM.2016.16-25.)

The OpenC L acceleration system designed by Aydona et al greatly improves the Performance by caching all intermediate features on a chip and reducing the multiply-accumulate operation of convolution by using a Winograd algorithm (L ing A C, Aydona U, O 'Connell S, et al. creating High Performance Applications with Intel' S FPGAOpenC L)^TMSDK[C].the 5th International Workshop.ACM，2017.)

There have been many studies and results on the FPGA acceleration of YO L O model, Duy et al used RT L circuit to accelerate YO L Ov2 model, binary-weight parameters in network, reduce DSP consumption in FPGA acceleration, reduce DRAM access by data multiplexing and dynamic random access, reduce Power consumption (Nguyen D T, Nguyen T N, KimH, et al.a High-Throughput and Power-Efficient FPGA augmentation of YO L O cnfor Object Detection J. IEEE Transactions on Very L image Integration (V4 SI) Systems,2019: 1-13.); nakaa et al combined with binary network and support vector machine (FPGA) in light YO 5 Ov2 model to achieve good FPGA acceleration effect, file, destination, file

The FPGA acceleration method based on the YO L O solves the problems of high power consumption, low speed and the like of target detection on edge computing equipment, but on-chip resources, bandwidth and power consumption of the FPGA are still the biggest challenges of the FPGA, and when a Winograd algorithm is introduced into the acceleration of the FPGA, the on-chip resources and the bandwidth are well utilized, and meanwhile, lower power consumption is guaranteed.

The FPGA accelerator design method based on the deep learning target detection model is a hot topic of edge calculation. However, in the existing accelerator design method, there are many problems such as unreasonable on-chip resource allocation and large power consumption, so that it is a very challenging technical task to realize high-efficiency and low-power consumption reasoning of the target detection model in the FPGA.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a Winograd YO L Ov2 target detection model method based on FPGA acceleration, an FPGA accelerator is designed based on a YO L Ov2 model of Winograd, the FPGA accelerator design of a YO L O model is realized on the basis of the acceleration of the existing YO L Ov2 model and the acceleration of Winograd algorithm (the Winograd algorithm is used as convolution optimization on a convolution kernel to reduce the calculated amount), an FPGA acceleration method based on Winograd YO L O is provided, the calculation complexity of the YO L O algorithm is reduced, an FPGA accelerator storage optimization algorithm is provided, the calculation time of the FPGA in accelerating the YO L O algorithm is shortened, the target detection is accelerated, and the target detection performance is effectively improved.

The invention adopts Winograd algorithm, aims to reduce the calculation amount by using the Winograd algorithm as the convolution optimization convolution kernel, and provides a new cache scheduling method, namely a cache pipeline, so as to reduce the model inference time.

The technical scheme of the invention is as follows:

a FPGA-accelerated Winograd YO L Ov2 target detection method is characterized in that a PYNQ (Python production for Zynq) board card produced by XI L INX is adopted to cache a YO L O network model and image feature map data to form a data path of a hardware accelerator, so that target detection of an image to be detected is realized, an accelerating circuit operation result can be read out, and image preprocessing and display are performed;

the PYNQ board main control chip ZYNQ7020 comprises two parts, namely a PS (Processing System) end and a P L (Programmable logic) end, wherein the PS end controls to cache a YO L O model and an image to be detected, then at the P L end, parameters of the YO L O model and the image to be detected are cached in an on-chip RAM (random Access memory) of the PYNQ board, a YO L O accelerator with a Winograd algorithm is designed and deployed, a scheduling strategy adopts a cache pipeline to finish the model acceleration operation and form a data path of the whole hardware accelerator, and finally, the operation result of the model at the P L end is read out by utilizing an AXI (Advanced eXtensible Interface) at the PS end, and the image is preprocessed and displayed at the PS end;

the Winograd YO L Ov2 target detection model method based on FPGA acceleration specifically comprises the following steps:

A. training a target detection network model:

a YO L Ov2 target detection network model (Molchanov V, Vishnyakov B V, Vizilter Y V, et al. Peer detection in video subsurface using fusion coherent YO L O neural network [ C ]// SPIE Optical metrology.2017:103340Q. DOI:10.1117/12.2270326) is selected to complete training, and the weight value of the YO L Ov2 target detection network model is obtained.

B. B, performing low-position localization on the YO L Ov2 target detection network model trained in the step A (L ow-Bit FixedPoint);

as shown in fig. 2, most of the formats for data storage in a computer are 32-bit floating point numbers, where the 32-bit floating point number includes a sign bit (S), a level code bit (M), and a tail bit (M), where the level code bit is an integer part of the floating point number and the tail bit is a decimal part of the floating point number. The difference between the fixed point number and the floating point number is that the decimal point of the fixed point number is fixed, so that the storage space in the FPGA can be greatly reduced, the calculated amount is reduced, and the specific process is as follows:

B1. obtaining an optimal fixed point quantification method of a YO L Ov2 target detection network model:

the optimal fixed-point parameter (tail code M) is determined by comparing the difference between the square sums of the network parameters before and after quantization_min) As shown in equation (1):

wherein, W_floatAn arbitrary weight parameter original floating point value representing a certain layer of the YO L Ov2 target detection network model, W (bw, M) represents that W is to be processed under the given bit width bw and the order code M_floatConverted back to the new floating point number W 'of the floating point after fixed point conversion'_float. The quantization of the bias parameter bias is similar and not described in more detail here.

B2. Obtain YO L Ov2 network layer number R, execute step B3 and repeat R times.

B3. Reading the weight of the current layer of the YO L Ov2 network, respectively spotting the weight and bias parameters to obtain the spotting model parameters, and specifically changing the 32-bit floating point number into 16-bit fixed point numbers (1-bit sign bit, Mbit order code bit, (16-M-1) bit tail bit).

B4. And testing the current model parameters according to the fixed-point model parameters obtained in the step B3, and verifying the accuracy of the model.

B4.1 16492 images were randomly selected as a test set from a PASCA L VOC0712(PASCA L: Pattern Analysis, Statistical model and Computational L earning, VOC: Visual Object Classes) data set.

And B4.2, loading the fixed-point model parameters into the YO L Ov2 target detection model, and carrying out forward model reasoning.

B4.3 calculating the map (mean average accuracy) of the model according to the reasoning result

C. An FPGA accelerator for YO L Ov2 was designed.

The convolution layer is complex in calculation and large in data quantity, so that the calculation time is long, and calculation resources are consumed greatly, therefore, a YO L Ov2 convolution kernel with a Winograd algorithm is designed at the P L end, a large number of multiplication operations are replaced by addition operations realized by the Winograd algorithm during convolution operation, multiplier resources consumed by calculation convolution are reduced, and the utilization rate of a multiplier of an FPGA is reduced under the condition of ensuring higher precision.

For the YO L Ov2 algorithm, the convolutions used are all convolutions of 3 × 3 and 1 × 1, the convolution kernel size is small, and the Winograd algorithm is suitable for accelerating the convolution operation, calculates the m-dimensional characteristic map output of a convolution kernel F (m, r) with the convolution kernel size of r by multiplying m + r-1 times, and the formula (2) represents that the convolution kernel size is 3 dimensions, and the Winograd minimum filter algorithm is used for carrying out convolution operation under the condition that the output vector is 2 dimensions, wherein d is the minimum of the filter algorithm, and the convolution kernel size is 3 dimensions_iRepresenting input feature map data in image convolution operations, d_iRepresenting convolution kernel data, m_iRepresenting output data。

m₀＝(d₀-d₂)g₀

m₃＝(d₁-d₃)g₂

The Winograd algorithm has an input of m + r-1 pixels of image data and an output of m-dimensional vectors in formula (2), and has an input of 4 pixels of image data and an output of 2-dimensional vectors, since the algorithm performs 4 times of addition of input data, 3 times of addition of convolution kernel and 4 times of addition of multiplication data, the algorithm increases the number of addition operations, but the number of multiplication operations is reduced from the original 6 times to 4 times, it can be seen that Winograd algorithm replaces multiplication operations by addition (L iu X, Pool J, Han S, et al.

C1. Input transform (Input transform): transforming the feature map data (convolution input In) taken from the buffer, wherein the values of both output transformation matrices A, B and G can be determined after m and r are determined, In is the convolution input, and thus a transformed feature matrix transform (In) can be obtained by formula (3):

Transform(In)＝B^TInB (3)

C2. a convolution kernel transform (Filter transform), where F is a convolution kernel parameter, is obtained by formula (4) to obtain a convolution kernel transform result transform (F):

Transform(F)＝G^TFG (4)

C3. obtaining a convolution result of Winograd through an Inverse transformation function, wherein E is a convolution output result, and obtaining a convolution calculation result Inverse _ transform (E) through a formula (5):

Inverse_Transform(E)＝A^TEA (5)

convolution module design of C4.YO L Ov2 network model

C4.1 flow of reading convolution operation data, prepare for YO L Ov2 convolution, the convolution calculation data flow designed by the invention is shown in FIG. 6:

and (4) storing the Input Feature Map (Input Feature Map) entering the convolutional layer operation in an On-chip cache (On-chip buffer), and storing the parameter file of the model obtained in the step B3 in the convolutional cache. And before the N characteristic diagrams enter a WinogradPE operation unit, unfolding the characteristic diagrams to obtain characteristic diagram vectors, and grouping the vectors. In a Winograd operation unit, the Feature Map vector and a convolution kernel are subjected to multiplication and addition operation, the convolution result of each Feature Map can be finally obtained, features are fused by an accumulator ACC unit, the calculation result is stored in an Output Feature Map (Output Feature Map) cache region, and the convolution operation in the next process is waited for reading.

C4.2 construction of Winograd PE (Processing Element arithmetic Unit)

The Winograd PE designed by the invention is divided into three parts to respectively transform the characteristic diagram and the convolution kernel entering the convolution unit, and finally, the operation is carried out, wherein the internal design is shown in figure 3. The process can be divided into three steps:

c4.2.1, converting the feature map data obtained from the buffer, when m and r are determined, the values of both conversion matrix A, B and G can be determined, thus formula (3) can obtain the converted input conversion feature matrix U;

c4.2.2 when the feature map conversion is completed, taking out the convolution kernel parameters stored in the buffer area, and obtaining the feature matrix V after the convolution kernel conversion by using the transformation of formula (4);

c4.2.3, the U, V matrix obtained in the above steps is transmitted to the PE arithmetic unit, and after dot product operation, M matrix is obtained, and finally the calculated output result is obtained, wherein N represents the number of input feature maps (number of channels), M represents the number of output feature maps (number of channels), and H × H represents the size of convolution kernel.

When the data of the characteristic diagram and the convolution kernel enter the PE operation unit to carry out accelerated operation, the characteristic diagram data and the convolution kernel data are unfolded and grouped, 6 cycles need to be executed in one conventional convolution operation, L oop-5 and L oop-6 can be saved after a Winograd algorithm is added, and multiplier consumption brought by cycle operation is saved in an FPGA.

Cache optimization and specific time calculation of D.P L cache pipeline

D1. Aiming at FPGA acceleration, the invention firstly provides a method for caching Pipeline Buffer Pipeline (a single Buffer set is improved into a multi-Buffer structure) to carry out FPGA acceleration. The specific process is as follows

D1.1 in the logic part of ZYNQ, the data exchange is carried out with the CPU through an external storage DDR DRAM, and the DDR is controlled by an on-chip bus AXI when being in data exchange with an accelerator.

D1.2 instantiates a FIFO interface behind the AXI bus, thereby ensuring that data input and output to the accelerator operation unit can be transmitted at high speed and high frequency. The Buffer cache cluster is added at the input interface of the accelerator operation unit so as to wait for the sign graph and the convolution kernel conversion operation, and the data cache pipeline architecture provided by the invention is shown in fig. 4.

D1.3, In the accelerator input data part, dividing an input Buffer cache cluster (sets) into a plurality of groups (such as Buf _ In1, Buffer _ In2 and Buffer _ In3), correspondingly dividing an output Buffer cluster into a plurality of groups (such as Buf _ Out1, Buffer _ Out2 and Buffer _ Out3), and forming a cache pipeline structure.

Through the steps, Winograd YO L Ov2 target detection based on FPGA acceleration is realized, and the target in the image to be detected is quickly obtained.

D2. Specific time calculation

D2.1 calculating total time consumption of FPGA for finishing one-time operation

Each BuffThe input data time of er is recorded as T_inThe time of each time the data in Buffer enters PE unit for operation is marked as T_coAnd the time for taking out the Buffer from the Buffer after the operation of the acceleration unit is finished is recorded as T_outThe time for completing the whole task flow is recorded as T_task. Setting the number of tasks completed in the acceleration unit as n and T_in≠T_co≠T_out(three operations are equal in time and do not affect the result). If the time sequence of the conventional access structure is adopted, the time T of all tasks is completed_sumAs shown in equation (12).

T_sum＝n×T_task＝n×(T_in+T_co+T_out) (12)

D2.2 obtaining improved pipeline storage optimization time

The Buffer Pipeline structure provided by the invention improves a single Buffer set into a three-Buffer structure, and carries out three-stage flowing water on the structure

Since the total task can be divided into three stages, when n tasks are completed, the total time T consumed_{BP_sum}As shown in equation (13).

A timing chart of the conventional calculation and the Buffer Pipeline structure proposed herein is shown in fig. 7, where the number of tasks n is 3, and the time taken to complete the tasks in the conventional calculation method is shown in formula (14).

T_sum＝3×T_task＝3×(T_in+T_co+T_out) (14)

When the buffering process is performed with Buffer Pipeline, the time taken to complete the entire task is as shown in equation (15).

The inequality property shows that:

thus, there is T_sum＞T_{BP_sum}It can be seen that the time T saved by the method proposed by the invention_saveAs shown in equation (17).

T_save＝T_sum-T_{BP_sum}

＝n×(T_in+T_co+T_out)-{T_in+max(T_in,T_co)+max(T_in,T_co,T_out)×[n-(3-1)]+max(T_co,T_out)+T_out} (17)

Compared with the prior art, the invention has the beneficial effects that:

(1) when the FPGA accelerates the YO L O algorithm, the Winograd algorithm is introduced into the YO L Ov2 model, and because a large number of convolution operations exist in the YO L Ov2 model, when the convolution operation is realized by a high-level synthesis (H L S) tool, a large number of multiplication operations in circulation are replaced by the addition operation realized by the Winograd algorithm, so that multiplier resources consumed by calculating the convolution are reduced, and the utilization rate of the multiplier of the FPGA is reduced under the condition that the realization precision is 78.25%.

(2) In order to improve the efficiency of data caching and processing, the invention provides a new cache scheduling method, namely a cache Pipeline (Buffer Pipeline), which carries out Pipeline optimization processing on a data cache entering the convolution operation of an accelerator each time, and can reduce the required time under the condition of finishing the same calculation task finally through time sequence analysis.

(3) The YO L Ov2 accelerator based on the PYNQ framework is provided, the acceleration of the rolling and pooling operation of each layer of YO L Ov2 is realized by utilizing the characteristics of low power consumption and high parallelism of a ZYNQ type FPGA, the data is subjected to fixed point processing, the weight 32-bits floating point number is fixed to be 16-bits data, the power consumption is reduced to 2.7w, and the problem that the embedded end realizes the high power consumption of a deep learning target detection and identification model is solved.

Drawings

FIG. 1 is a flow chart of an accelerated optimization method of a YO L Ov2 target detection model based on a PYNQ platform.

FIG. 2 is a schematic diagram of floating-point transformation to fixed-point transformation of model parameters;

wherein (a) is a 32-bits floating point number and (b) is a 16-bits fixed point number.

FIG. 3 is a schematic structural diagram of the YO L Ov2 accelerator Winograd PE.

FIG. 4 is a schematic diagram of an internal structure of an accelerator based on cache pipeline optimization.

FIG. 5 is a diagram of network accuracy variation under different fixed-point conditions;

wherein (a) represents the size change of YO L Ov2, Tiny-YO L O and Sim-YO L O models under the 32-bit, 16-bit and 8-bit parameter types respectively, and (b) represents the precision change of YO L Ov2, Tiny-YO L O and Sim-YO L O models under the 32-bit, 16-bit and 8-bit parameter types respectively.

FIG. 6 is a schematic diagram of the YO L Ov2 accelerator data flow.

FIG. 7 shows the timing variation between the case where no cache Pipeline is added to the arithmetic unit of the accelerator and the case where the cache Pipeline is added to the arithmetic unit of the accelerator, wherein the Buffer Pipeline method can save time when three tasks are executed; where Buffer In, computer and Buffer Out represent the three stages of completing a computational task.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The integral hardware architecture of the accelerator designed by the invention is shown in fig. 1, firstly training of a YO L Ov2 model is completed on an upper computer, a VOC data set (VOC2007+2012) is used, 16551 pictures are randomly selected as a training set, 16492 pictures are selected as a test set, then a model fixed-point task is performed, an edge algorithm is completed on an embedded end, an ARM core is integrated on a PS end, a L inux operating system is mounted, a Python language environment is reserved when the operating system is transplanted, a CPU can control all interfaces between the PS and a P L, the accelerator inputs a feature diagram of the YO L O model into a DDR cache through CPU scheduling, the DDR cache interacts with peripheral circuits of the operating system through a bus, the CPU can read an accelerating circuit operation result through an AXI bus, and image preprocessing and display are performed on the PS end.

In the P L logic part, data in an external storage DDR is cached in an on-chip RAM, a convolution and pooling circuit of an YO L O accelerator is laid out and wired in an FPGA, finally, a hardware design bit file (Bitstream) and a design instruction file (Tcl) are transmitted to an Overlay of an operating system, a hardware circuit and an IP core operation circuit of the YO L O are analyzed in the Overlay, and finally a data path of the whole hardware accelerator is formed

The invention is further described below by way of example according to the following steps:

training of YO L Ov2 target detection model, wherein Table 1 shows parameter configuration of YO L Ov2 model

Table 1 YO L Ov2 model parameter configuration used in embodiments of the present invention

In table 1, C represents a convolutional layer; m represents a pooling layer;

2. and (3) carrying out low localization (L ow-Bit Fixed Point) on the YO L Ov2 model in the step 1, and executing the following operations:

2.1 obtaining the best fixed point quantization method of network, comparing the difference of the parameter square sum of each parameter of network before and after quantization to determine the best fixed point quantization parameter (tail code M)_min)；

2.2 obtaining YO L Ov2 network layer number R, repeating the process for 2.1 times Q times;

2.3 reading the weight of the current layer, fixing the weight and bias, respectively fixing the weight and bias, changing the 32-bit floating point number into 16-bit fixed point number, including: 1bit sign bit, M_minbit code order, (16-M)_min-1) bit tail bits;

2.4 testing the fixed-point model, including the following processes;

2.4.1 randomly selected 16492 sheets from the VOC data set (VOC2007+2012) as the test set.

2.4.2 loading the fixed-point model parameters into a YO L Ov2 target detection model, completing operations such as convolution, pooling and the like, and completing forward reasoning of the network.

2.4.3 calculating the map of the model (mean average accuracy) from the inference results

In the process of data stationing, the storage occupied by the network model is also reduced, compared with the original precision model, in 16-bit stationing, the size of the YO L Ov2 model is reduced by 7 ×, the size of the YO L Ov2 model is reduced by 20 ×, and the size of the Yny-YO L O and the size-YO L O are respectively reduced by 8 × and 12 × in 8-bit stationing compared with the original precision model, as can be seen from FIG. 5, through 16-bit stationing, the precision of the YO L Ov2 model can be ensured, and the size of the model can also be reduced.

3. Designing an FPGA accelerator for YO L Ov 2;

YO L Ov2 accelerator data flow as shown in fig. 6, includes the following processes:

3.1 Input transform: converting the feature map data fetched from the buffer;

3.2 obtaining a convolution kernel conversion result by convolution kernel conversion (Filter transform);

3.3 obtaining a convolution result of Winograd through an inverse transformation function;

3.4 designing a YO L Ov2 convolution module, and constructing a Winograd PE operation unit;

3.4.1 flow of reading convolution operation data, preparing for YO L Ov2 convolution;

3.4.2 transform the eigen map data retrieved from the buffer, wherein the values of both transformation matrices A, B and G can be determined after m and r are determined. As shown in equation (18):

Out＝A^T[(GFG^T)⊙(B^TInB)]A (18)

3.4.2.1 Input transform: the U bits are convolved and the transformed feature matrix U is obtained by equation (19):

U＝B^TInB (19)

3.4.2.2 convolution kernel transform (Filter transform), where F is a convolution kernel parameter, by equation (20), a convolution kernel transform result V is obtained:

V＝G^TInG (20)

3.4.2.3, the U, V matrix obtained in the step 3.4.2.1 and the step 3.4.2.2 is transmitted to the PE arithmetic unit, and the dot product operation is carried Out through the formula (18) to obtain an output result Out matrix.

The storage optimization and specific time calculation of the P L cache pipeline comprise the following processes:

4.1 storage optimization steps for the P L cache pipeline are as follows:

4.1.1 setting data exchange mode: the data exchange is carried out with the CPU through an external memory DDR DRAM, and the DDR is controlled by an on-chip bus AXI when the DDR exchanges data with an accelerator.

4.1.2 instantiate a FIFO interface behind the AXI bus to ensure that data input and output to the accelerator operation units can be transferred at high frequency and at efficient speed. And adding a Buffer cluster at the input interface of the accelerator operation unit so as to convert the data into a format and wait for time.

4.1.3 In the accelerator input data part, dividing the input Buffer clusters (sets) into Buf _ In1, Buffer _ In2 and Buffer _ In3, and dividing the output Buffer clusters into Buf _ Out1, Buffer _ Out2 and Buffer _ Out 3.

4.2FPGA computation time calculation

4.2.1 obtaining the total time consumption of FPGA to finish one-time operation

4.2.2 obtaining the improved pipeline storage optimization time, pipelining the operations of reading the characteristic diagram, performing convolution calculation, writing the characteristic diagram and the like, and completing a plurality of operations in the same clock period, wherein T_sumTo optimize the time required for pre-operation, T_{BP_sum}Optimizing the total time for adopting running water, wherein T_saveTo save time, as shown in fig. 7.

5. Overall YO L Ov2 accelerator performance evaluation

The Winograd algorithm parameter used by the convolutional layer of YO L Ov2 is F (2 × 2,3 × 3), an improved experiment is carried out on the convolution, a YO L O acceleration IP core is generated in Vivado H L S, a hardware bit file and a parameter file are generated in Block design, an operating system of PS schedules hardware logic and allocates acceleration resources, data quantization processing is carried out before model parameters enter FPGA, the data are quantized into fixed 16-bits type data, the average time of processing each picture by a final acceleration platform test is 124ms, and the average detection precision is 78.25%.

Compared with the acceleration of other platforms, the PYNQ platform-based accelerator provided by the invention is compared with the acceleration of other platforms, as shown in Table 2, compared with a GPU platform, the PYNQ platform-based accelerator is not reduced in precision and is greatly improved in power consumption.

TABLE 2 hardware implementation of the YO L O model herein vs. other method Performance

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A Winograd YO L Ov2 target detection method based on FPGA acceleration adopts a PYNQ board card, wherein a main control chip of the PYNQ board card comprises a processing system end PS and a programmable logic end P L, wherein the PS end caches a YO L O model and characteristic map data of an image to be detected, the P L end caches parameters of the YO L O model and the image to be detected in an on-chip RAM, and a YO L O accelerator with a Winograd algorithm is deployed to finish model acceleration operation, form a data path of a hardware accelerator, realize target detection of the image to be detected, and also can read operation results of an acceleration circuit and perform image preprocessing and display;

the method comprises the following steps:

A. training a YO L Ov2 target detection network model, and obtaining a weight value weight of the YO L Ov2 target detection network model;

B. and B, performing low-order spotting on the YO L Ov2 target detection network model trained in the step A, wherein the specific process is as follows:

B1. the optimal fixed point quantization method for obtaining the YO L Ov2 target detection network model comprises the steps of comparing the square sum difference of each parameter of the network before and after quantization to determine the optimal fixed point parameter, namely the tail code M_min；

B2. Acquiring the network layer number R of a YO L Ov2 target detection network model;

B3. acquiring the weight of each layer of the YO L Ov2 network, and performing fixed-point processing on the weight value weight and the bias parameter value bias to obtain fixed-point model parameters;

B4. testing the current model parameters according to the fixed-point model parameters obtained in B3, and verifying the accuracy of the model;

C. designing an FPGA accelerator for YO L Ov2, and using a method that Winograd algorithm replaces multiplication operation by addition in the accelerator of YO L Ov2, wherein the method comprises the following steps:

designing YO L Ov2 convolution kernel with Winograd algorithm at a P L end, converting a large amount of multiplication operation into addition operation realized by the Winograd algorithm during convolution operation, accelerating the convolution operation by adopting the Winograd algorithm, calculating m-dimensional characteristic diagram output of a convolution kernel F (m, r) with the convolution kernel size of r by using m + r-1 times of multiplication through the Winograd algorithm, namely outputting image data with m + r-1 pixels as m-dimensional vectors, and using the Winograd algorithm in an accelerator of YO L Ov2 by replacing the multiplication operation with the addition method, wherein the method comprises the following steps:

C1. transforming the characteristic diagram data obtained from the buffer by input transformation to obtain a transformed characteristic matrix transform (In), wherein In is convolution input;

C2. obtaining a convolution kernel conversion result transform (F) through convolution kernel conversion, wherein F is a convolution kernel parameter;

C3. obtaining a convolution calculation result Inverse _ transform (E) of Winograd through an Inverse transformation function, wherein E is a convolution output result;

C4. a convolution module for designing a YO L Ov2 network model, comprising:

c4.1 designing a convolution calculation data stream, and reading a flow of convolution calculation data;

c4.2 constructing a Winograd PE operation unit; dividing a Winograd PE operation unit into three parts, respectively transforming a characteristic diagram and a convolution kernel which enter a convolution unit, and then performing operation; the method comprises the following steps:

c4.2.1, converting the characteristic diagram data obtained from the buffer to obtain a converted characteristic matrix U;

c4.2.2, taking out the convolution kernel parameters stored in the buffer area, and obtaining the transformed characteristic matrix V through transformation;

c4.2.3, transmitting the matrix U, V obtained in the above steps to an arithmetic unit for dot product operation to obtain a matrix M, and obtaining an output result, wherein M represents the number of output characteristic diagrams or channels;

D.P L cache pipeline storage optimization;

D1. aiming at FPGA acceleration, a cache pipeline method is adopted to improve a single cache set into a multi-cache structure for FPGA acceleration; the process is as follows:

d1.1, in a logic part of ZYNQ, data interaction is carried out with a CPU through an external storage DDR DRAM; the DDR is controlled by an on-chip bus AXI when exchanging data with the accelerator;

d1.2 instantiating a FIFO interface behind the AXI bus to enable data input and output to the accelerator operation unit to be transmitted at high speed and high frequency; adding a cache cluster at an input interface of an accelerator operation unit, converting the data into a format and waiting;

d1.3, in the data input part of the accelerator, dividing an input cache cluster into a plurality of parts, and correspondingly dividing an output cache cluster page into a plurality of parts to form a cache pipeline structure; when normal data interaction and transmission are ensured, each cache is fully utilized, and the storage capacity of each cache is utilized to the maximum extent in a hopping period of a clock bus;

2. The FPGA acceleration-based Winograd YO L Ov2 target detection method of claim 1, wherein the total time consumed for the FPGA to complete one operation is calculated by the following method:

the input data time of each Buffer is recorded as T_inThe time of each time the data in Buffer enters PE unit for operation is marked as T_coAnd the time for taking out the Buffer from the Buffer after the operation of the acceleration unit is finished is recorded as T_outThe time for completing the whole task flow is recorded as T_task(ii) a Setting the number of completed tasks in the acceleration unit as n, and completing all tasks according to the time sequence of the conventional access structure_sumRepresented by formula (12):

T_sum＝n×T_task＝n×(T_in+T_co+T_out) (12)

the improved flow memory optimization time is calculated by adopting the following method:

improving a single cache set into a multi-cache structure, and performing three-level flow on the structure; is provided with

The total task is divided into three stages, and when n tasks are completed, the total time T consumed_{BP_sum}Represented by formula (13):

let the number of tasks n be 3, the time taken to complete the task is represented by equation (14):

T_sum＝3×T_task＝3×(T_in+T_co+T_out) (14)

when the buffering process is performed by the buffering pipeline, the time taken to complete the entire task is expressed by equation (15):

T_save＝T_sum-T_{BP_sum}

saved time T_saveRepresented by formula (17).

3. The FPGA-acceleration-based Winograd YO L Ov2 target detection method as claimed in claim 1, wherein the operation result of the P L end model is read out through an AXI bus of the PS end, and image preprocessing and display are performed at the PS end.

4. The FPGA-based accelerated Winograd YO L Ov2 target detection method of claim 1, wherein the step B1 is to obtain an optimal fixed-point quantization method of a YO L Ov2 target detection network model, and specifically, the optimal fixed-point quantization parameter, namely, the tail code M, is determined by comparing the difference of the square sums of the parameters of the network before and after quantization through a formula (1)_min：

Wherein, W_floatAn arbitrary weight parameter representing a layer, the original floating-point value, W (bw, M) representing W, given a bit width bw and a level code M_floatConverted back to the new floating point number W 'of the floating point after fixed point conversion'_float。

5. The FPGA-based accelerated Winograd YO L Ov2 target detection method according to claim 4, wherein the step B3 reads the weight of the current layer of the YO L Ov2 target detection network model, respectively fixes the weight value and the bias parameter value, and specifically changes 32-bit floating point number into 16-bit fixed point number including 1-bit sign bit, M-bit sign bit_minbit order, 16-M_min-1bit tail bits.

6. The FPGA-based accelerated Winograd YO L Ov2 target detection method of claim 1, wherein the step B4 tests current model parameters to verify the accuracy of the model, comprises the steps of:

b4.1 randomly selecting 16492 images from the VOC data set as a test set;

b4.2, loading the fixed-point model parameters into a YO L Ov2 target detection model, and carrying out forward model reasoning;

and B4.3, calculating the average precision of the model according to the inference result.

7. The FPGA acceleration-based Winograd YO L Ov2 target detection method as claimed in claim 1, wherein the FPGA accelerator for YO L Ov2 is designed in step C, the method comprises the steps of designing a YO L Ov2 convolution kernel with a Winograd algorithm at the P L end, accelerating convolution operation by adopting the Winograd algorithm, calculating m-dimensional feature map output with the convolution kernel size of r convolution kernel F (m, r) by multiplying v (F (m, r)) m + r-1 times by using the Winograd algorithm, and expressing the convolution kernel size as formula (2) by using a Winograd minimum filter algorithm under the condition that the output vector is 2-dimensional and the convolution kernel size is 3-dimensional:

m₀＝(d₀-d₂)g₀

m₃＝(d₁-d₃)g₂

wherein d is_iRepresenting input feature map data in image convolution operations, d_iRepresenting convolution kernel data, m_iRepresenting the output data; the input of the Winograd algorithm is image data of m + r < -1 > pixels, and the output is a vector of m dimension; in equation (2), 4-pixel image data is input, and a 2-dimensional vector is output.

8. The FPGA acceleration-based Winograd YO L Ov2 target detection method of claim 7, wherein the step C1 transforms the feature map data fetched from the buffer by input conversion:

determining values of output transformation matrices A, B and G from the m and r values; specifically, a transformed feature matrix transform (in) is obtained by the following formula (3):

Transform(In)＝B^TInB formula (3)

Step C2 specifically obtains the convolution kernel conversion result transform (f) by equation (4):

Transform(F)＝G^TFG (4)

Inverse_Transform(E)＝A^TEA (5)

specifically, in step C3, the convolution calculation result Inverse _ transform (e) is obtained by performing an Inverse transformation function on equation (5).