CN111459877A - FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method - Google Patents

FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method Download PDF

Info

Publication number
CN111459877A
CN111459877A CN202010254820.9A CN202010254820A CN111459877A CN 111459877 A CN111459877 A CN 111459877A CN 202010254820 A CN202010254820 A CN 202010254820A CN 111459877 A CN111459877 A CN 111459877A
Authority
CN
China
Prior art keywords
winograd
target detection
data
model
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010254820.9A
Other languages
Chinese (zh)
Other versions
CN111459877B (en
Inventor
于重重
鲍春
谢涛
常乐
冯文彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
CCTEG China Coal Technology and Engineering Group Corp
Original Assignee
Beijing Technology and Business University
CCTEG China Coal Technology and Engineering Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University, CCTEG China Coal Technology and Engineering Group Corp filed Critical Beijing Technology and Business University
Priority to CN202010254820.9A priority Critical patent/CN111459877B/en
Publication of CN111459877A publication Critical patent/CN111459877A/en
Application granted granted Critical
Publication of CN111459877B publication Critical patent/CN111459877B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a Winograd YO L Ov2 target detection model method based on FPGA acceleration, which adopts a PYNQ board card, wherein a main control chip of the PYNQ board card comprises a processing system end PS and a programmable logic end P L, wherein the PS end caches a YO L O model and characteristic map data of an image to be detected, the P L end caches parameters of the YO L O model and the image to be detected in an on-chip RAM, a YO L O accelerator with a Winograd algorithm is deployed to finish model acceleration operation, a data path of a hardware accelerator is formed to realize target detection of the image to be detected, an operation result of an acceleration circuit can be read out, and image preprocessing and display are carried out.

Description

FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method
Technical Field
The invention belongs to the technical field of computer vision and edge calculation, and relates to a design method of an FPGA accelerator for a target detection model.
Background
Representative models are single-shot-multi-detection (SSD), FasterR-CNN and you-only-look-once (YO L O network model) series, wherein the YO L O algorithm has faster and more accurate performance advantages.
Most of target detection and identification models based on the deep learning network are completed in an image processor (GPU), and due to the fact that the number of parallel computing Units is large, performance advantages shown in a convolutional neural network with a large number of repeated multiply-add operations are more prominent. However, the edge calculation needs to be performed on a small-sized, fast-operating, and low-power-consumption computing device, and therefore, the GPU is difficult to meet the above requirements. Application-specific integrated circuits (ASICs) and FPGAs are prominent in meeting edge computing requirements, and FPGAs have advantages of 1) high flexibility: the FPGA can execute any logic function which the ASIC can execute, and the special advantage is that the chip function can be changed at any time; 2) the development time is short: the FPGA can be directly programmed without stream slice; 3) the cost is low: compared with the cost of no need of tape-out of ASIC, the method is more suitable for small-scale use.
Suda et al propose a fixed-point convolutional neural network acceleration design using the OpenC L framework, and propose a systematic method to minimize the execution time given the FPGA resource constraints (Suda N, Chandra V, Dasika G, actual.Throughput-optimized OpenC L based FPGA access for large-scale-scalable neural networks [ C ]. Proceedings of the 2016ACM/SIGDA interfacial neural network, ACM.2016.16-25.)
The OpenC L acceleration system designed by Aydona et al greatly improves the Performance by caching all intermediate features on a chip and reducing the multiply-accumulate operation of convolution by using a Winograd algorithm (L ing A C, Aydona U, O 'Connell S, et al. creating High Performance Applications with Intel' S FPGAOpenC L)TMSDK[C].the 5th International Workshop.ACM,2017.)
There have been many studies and results on the FPGA acceleration of YO L O model, Duy et al used RT L circuit to accelerate YO L Ov2 model, binary-weight parameters in network, reduce DSP consumption in FPGA acceleration, reduce DRAM access by data multiplexing and dynamic random access, reduce Power consumption (Nguyen D T, Nguyen T N, KimH, et al.a High-Throughput and Power-Efficient FPGA augmentation of YO L O cnfor Object Detection J. IEEE Transactions on Very L image Integration (V4 SI) Systems,2019: 1-13.); nakaa et al combined with binary network and support vector machine (FPGA) in light YO 5 Ov2 model to achieve good FPGA acceleration effect, file, destination, file
The FPGA acceleration method based on the YO L O solves the problems of high power consumption, low speed and the like of target detection on edge computing equipment, but on-chip resources, bandwidth and power consumption of the FPGA are still the biggest challenges of the FPGA, and when a Winograd algorithm is introduced into the acceleration of the FPGA, the on-chip resources and the bandwidth are well utilized, and meanwhile, lower power consumption is guaranteed.
The FPGA accelerator design method based on the deep learning target detection model is a hot topic of edge calculation. However, in the existing accelerator design method, there are many problems such as unreasonable on-chip resource allocation and large power consumption, so that it is a very challenging technical task to realize high-efficiency and low-power consumption reasoning of the target detection model in the FPGA.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a Winograd YO L Ov2 target detection model method based on FPGA acceleration, an FPGA accelerator is designed based on a YO L Ov2 model of Winograd, the FPGA accelerator design of a YO L O model is realized on the basis of the acceleration of the existing YO L Ov2 model and the acceleration of Winograd algorithm (the Winograd algorithm is used as convolution optimization on a convolution kernel to reduce the calculated amount), an FPGA acceleration method based on Winograd YO L O is provided, the calculation complexity of the YO L O algorithm is reduced, an FPGA accelerator storage optimization algorithm is provided, the calculation time of the FPGA in accelerating the YO L O algorithm is shortened, the target detection is accelerated, and the target detection performance is effectively improved.
The invention adopts Winograd algorithm, aims to reduce the calculation amount by using the Winograd algorithm as the convolution optimization convolution kernel, and provides a new cache scheduling method, namely a cache pipeline, so as to reduce the model inference time.
The technical scheme of the invention is as follows:
a FPGA-accelerated Winograd YO L Ov2 target detection method is characterized in that a PYNQ (Python production for Zynq) board card produced by XI L INX is adopted to cache a YO L O network model and image feature map data to form a data path of a hardware accelerator, so that target detection of an image to be detected is realized, an accelerating circuit operation result can be read out, and image preprocessing and display are performed;
the PYNQ board main control chip ZYNQ7020 comprises two parts, namely a PS (Processing System) end and a P L (Programmable logic) end, wherein the PS end controls to cache a YO L O model and an image to be detected, then at the P L end, parameters of the YO L O model and the image to be detected are cached in an on-chip RAM (random Access memory) of the PYNQ board, a YO L O accelerator with a Winograd algorithm is designed and deployed, a scheduling strategy adopts a cache pipeline to finish the model acceleration operation and form a data path of the whole hardware accelerator, and finally, the operation result of the model at the P L end is read out by utilizing an AXI (Advanced eXtensible Interface) at the PS end, and the image is preprocessed and displayed at the PS end;
the Winograd YO L Ov2 target detection model method based on FPGA acceleration specifically comprises the following steps:
A. training a target detection network model:
a YO L Ov2 target detection network model (Molchanov V, Vishnyakov B V, Vizilter Y V, et al. Peer detection in video subsurface using fusion coherent YO L O neural network [ C ]// SPIE Optical metrology.2017:103340Q. DOI:10.1117/12.2270326) is selected to complete training, and the weight value of the YO L Ov2 target detection network model is obtained.
B. B, performing low-position localization on the YO L Ov2 target detection network model trained in the step A (L ow-Bit FixedPoint);
as shown in fig. 2, most of the formats for data storage in a computer are 32-bit floating point numbers, where the 32-bit floating point number includes a sign bit (S), a level code bit (M), and a tail bit (M), where the level code bit is an integer part of the floating point number and the tail bit is a decimal part of the floating point number. The difference between the fixed point number and the floating point number is that the decimal point of the fixed point number is fixed, so that the storage space in the FPGA can be greatly reduced, the calculated amount is reduced, and the specific process is as follows:
B1. obtaining an optimal fixed point quantification method of a YO L Ov2 target detection network model:
the optimal fixed-point parameter (tail code M) is determined by comparing the difference between the square sums of the network parameters before and after quantizationmin) As shown in equation (1):
Figure BDA0002436886230000031
wherein, WfloatAn arbitrary weight parameter original floating point value representing a certain layer of the YO L Ov2 target detection network model, W (bw, M) represents that W is to be processed under the given bit width bw and the order code MfloatConverted back to the new floating point number W 'of the floating point after fixed point conversion'float. The quantization of the bias parameter bias is similar and not described in more detail here.
B2. Obtain YO L Ov2 network layer number R, execute step B3 and repeat R times.
B3. Reading the weight of the current layer of the YO L Ov2 network, respectively spotting the weight and bias parameters to obtain the spotting model parameters, and specifically changing the 32-bit floating point number into 16-bit fixed point numbers (1-bit sign bit, Mbit order code bit, (16-M-1) bit tail bit).
B4. And testing the current model parameters according to the fixed-point model parameters obtained in the step B3, and verifying the accuracy of the model.
B4.1 16492 images were randomly selected as a test set from a PASCA L VOC0712(PASCA L: Pattern Analysis, Statistical model and Computational L earning, VOC: Visual Object Classes) data set.
And B4.2, loading the fixed-point model parameters into the YO L Ov2 target detection model, and carrying out forward model reasoning.
B4.3 calculating the map (mean average accuracy) of the model according to the reasoning result
C. An FPGA accelerator for YO L Ov2 was designed.
The convolution layer is complex in calculation and large in data quantity, so that the calculation time is long, and calculation resources are consumed greatly, therefore, a YO L Ov2 convolution kernel with a Winograd algorithm is designed at the P L end, a large number of multiplication operations are replaced by addition operations realized by the Winograd algorithm during convolution operation, multiplier resources consumed by calculation convolution are reduced, and the utilization rate of a multiplier of an FPGA is reduced under the condition of ensuring higher precision.
For the YO L Ov2 algorithm, the convolutions used are all convolutions of 3 × 3 and 1 × 1, the convolution kernel size is small, and the Winograd algorithm is suitable for accelerating the convolution operation, calculates the m-dimensional characteristic map output of a convolution kernel F (m, r) with the convolution kernel size of r by multiplying m + r-1 times, and the formula (2) represents that the convolution kernel size is 3 dimensions, and the Winograd minimum filter algorithm is used for carrying out convolution operation under the condition that the output vector is 2 dimensions, wherein d is the minimum of the filter algorithm, and the convolution kernel size is 3 dimensionsiRepresenting input feature map data in image convolution operations, diRepresenting convolution kernel data, miRepresenting output data。
Figure BDA0002436886230000041
m0=(d0-d2)g0
Figure BDA0002436886230000042
Figure BDA0002436886230000043
m3=(d1-d3)g2
The Winograd algorithm has an input of m + r-1 pixels of image data and an output of m-dimensional vectors in formula (2), and has an input of 4 pixels of image data and an output of 2-dimensional vectors, since the algorithm performs 4 times of addition of input data, 3 times of addition of convolution kernel and 4 times of addition of multiplication data, the algorithm increases the number of addition operations, but the number of multiplication operations is reduced from the original 6 times to 4 times, it can be seen that Winograd algorithm replaces multiplication operations by addition (L iu X, Pool J, Han S, et al.
C1. Input transform (Input transform): transforming the feature map data (convolution input In) taken from the buffer, wherein the values of both output transformation matrices A, B and G can be determined after m and r are determined, In is the convolution input, and thus a transformed feature matrix transform (In) can be obtained by formula (3):
Transform(In)=BTInB (3)
C2. a convolution kernel transform (Filter transform), where F is a convolution kernel parameter, is obtained by formula (4) to obtain a convolution kernel transform result transform (F):
Transform(F)=GTFG (4)
C3. obtaining a convolution result of Winograd through an Inverse transformation function, wherein E is a convolution output result, and obtaining a convolution calculation result Inverse _ transform (E) through a formula (5):
Inverse_Transform(E)=ATEA (5)
convolution module design of C4.YO L Ov2 network model
C4.1 flow of reading convolution operation data, prepare for YO L Ov2 convolution, the convolution calculation data flow designed by the invention is shown in FIG. 6:
and (4) storing the Input Feature Map (Input Feature Map) entering the convolutional layer operation in an On-chip cache (On-chip buffer), and storing the parameter file of the model obtained in the step B3 in the convolutional cache. And before the N characteristic diagrams enter a WinogradPE operation unit, unfolding the characteristic diagrams to obtain characteristic diagram vectors, and grouping the vectors. In a Winograd operation unit, the Feature Map vector and a convolution kernel are subjected to multiplication and addition operation, the convolution result of each Feature Map can be finally obtained, features are fused by an accumulator ACC unit, the calculation result is stored in an Output Feature Map (Output Feature Map) cache region, and the convolution operation in the next process is waited for reading.
C4.2 construction of Winograd PE (Processing Element arithmetic Unit)
The Winograd PE designed by the invention is divided into three parts to respectively transform the characteristic diagram and the convolution kernel entering the convolution unit, and finally, the operation is carried out, wherein the internal design is shown in figure 3. The process can be divided into three steps:
c4.2.1, converting the feature map data obtained from the buffer, when m and r are determined, the values of both conversion matrix A, B and G can be determined, thus formula (3) can obtain the converted input conversion feature matrix U;
c4.2.2 when the feature map conversion is completed, taking out the convolution kernel parameters stored in the buffer area, and obtaining the feature matrix V after the convolution kernel conversion by using the transformation of formula (4);
c4.2.3, the U, V matrix obtained in the above steps is transmitted to the PE arithmetic unit, and after dot product operation, M matrix is obtained, and finally the calculated output result is obtained, wherein N represents the number of input feature maps (number of channels), M represents the number of output feature maps (number of channels), and H × H represents the size of convolution kernel.
When the data of the characteristic diagram and the convolution kernel enter the PE operation unit to carry out accelerated operation, the characteristic diagram data and the convolution kernel data are unfolded and grouped, 6 cycles need to be executed in one conventional convolution operation, L oop-5 and L oop-6 can be saved after a Winograd algorithm is added, and multiplier consumption brought by cycle operation is saved in an FPGA.
Cache optimization and specific time calculation of D.P L cache pipeline
D1. Aiming at FPGA acceleration, the invention firstly provides a method for caching Pipeline Buffer Pipeline (a single Buffer set is improved into a multi-Buffer structure) to carry out FPGA acceleration. The specific process is as follows
D1.1 in the logic part of ZYNQ, the data exchange is carried out with the CPU through an external storage DDR DRAM, and the DDR is controlled by an on-chip bus AXI when being in data exchange with an accelerator.
D1.2 instantiates a FIFO interface behind the AXI bus, thereby ensuring that data input and output to the accelerator operation unit can be transmitted at high speed and high frequency. The Buffer cache cluster is added at the input interface of the accelerator operation unit so as to wait for the sign graph and the convolution kernel conversion operation, and the data cache pipeline architecture provided by the invention is shown in fig. 4.
D1.3, In the accelerator input data part, dividing an input Buffer cache cluster (sets) into a plurality of groups (such as Buf _ In1, Buffer _ In2 and Buffer _ In3), correspondingly dividing an output Buffer cluster into a plurality of groups (such as Buf _ Out1, Buffer _ Out2 and Buffer _ Out3), and forming a cache pipeline structure.
Through the steps, Winograd YO L Ov2 target detection based on FPGA acceleration is realized, and the target in the image to be detected is quickly obtained.
D2. Specific time calculation
D2.1 calculating total time consumption of FPGA for finishing one-time operation
Each BuffThe input data time of er is recorded as TinThe time of each time the data in Buffer enters PE unit for operation is marked as TcoAnd the time for taking out the Buffer from the Buffer after the operation of the acceleration unit is finished is recorded as ToutThe time for completing the whole task flow is recorded as Ttask. Setting the number of tasks completed in the acceleration unit as n and Tin≠Tco≠Tout(three operations are equal in time and do not affect the result). If the time sequence of the conventional access structure is adopted, the time T of all tasks is completedsumAs shown in equation (12).
Tsum=n×Ttask=n×(Tin+Tco+Tout) (12)
D2.2 obtaining improved pipeline storage optimization time
The Buffer Pipeline structure provided by the invention improves a single Buffer set into a three-Buffer structure, and carries out three-stage flowing water on the structure
Figure BDA0002436886230000071
Since the total task can be divided into three stages, when n tasks are completed, the total time T consumedBP_sumAs shown in equation (13).
Figure BDA0002436886230000072
A timing chart of the conventional calculation and the Buffer Pipeline structure proposed herein is shown in fig. 7, where the number of tasks n is 3, and the time taken to complete the tasks in the conventional calculation method is shown in formula (14).
Tsum=3×Ttask=3×(Tin+Tco+Tout) (14)
When the buffering process is performed with Buffer Pipeline, the time taken to complete the entire task is as shown in equation (15).
Figure BDA0002436886230000073
The inequality property shows that:
Figure BDA0002436886230000074
thus, there is Tsum>TBP_sumIt can be seen that the time T saved by the method proposed by the inventionsaveAs shown in equation (17).
Tsave=Tsum-TBP_sum
=n×(Tin+Tco+Tout)-{Tin+max(Tin,Tco)+max(Tin,Tco,Tout)×[n-(3-1)]+max(Tco,Tout)+Tout} (17)
Compared with the prior art, the invention has the beneficial effects that:
(1) when the FPGA accelerates the YO L O algorithm, the Winograd algorithm is introduced into the YO L Ov2 model, and because a large number of convolution operations exist in the YO L Ov2 model, when the convolution operation is realized by a high-level synthesis (H L S) tool, a large number of multiplication operations in circulation are replaced by the addition operation realized by the Winograd algorithm, so that multiplier resources consumed by calculating the convolution are reduced, and the utilization rate of the multiplier of the FPGA is reduced under the condition that the realization precision is 78.25%.
(2) In order to improve the efficiency of data caching and processing, the invention provides a new cache scheduling method, namely a cache Pipeline (Buffer Pipeline), which carries out Pipeline optimization processing on a data cache entering the convolution operation of an accelerator each time, and can reduce the required time under the condition of finishing the same calculation task finally through time sequence analysis.
(3) The YO L Ov2 accelerator based on the PYNQ framework is provided, the acceleration of the rolling and pooling operation of each layer of YO L Ov2 is realized by utilizing the characteristics of low power consumption and high parallelism of a ZYNQ type FPGA, the data is subjected to fixed point processing, the weight 32-bits floating point number is fixed to be 16-bits data, the power consumption is reduced to 2.7w, and the problem that the embedded end realizes the high power consumption of a deep learning target detection and identification model is solved.
Drawings
FIG. 1 is a flow chart of an accelerated optimization method of a YO L Ov2 target detection model based on a PYNQ platform.
FIG. 2 is a schematic diagram of floating-point transformation to fixed-point transformation of model parameters;
wherein (a) is a 32-bits floating point number and (b) is a 16-bits fixed point number.
FIG. 3 is a schematic structural diagram of the YO L Ov2 accelerator Winograd PE.
FIG. 4 is a schematic diagram of an internal structure of an accelerator based on cache pipeline optimization.
FIG. 5 is a diagram of network accuracy variation under different fixed-point conditions;
wherein (a) represents the size change of YO L Ov2, Tiny-YO L O and Sim-YO L O models under the 32-bit, 16-bit and 8-bit parameter types respectively, and (b) represents the precision change of YO L Ov2, Tiny-YO L O and Sim-YO L O models under the 32-bit, 16-bit and 8-bit parameter types respectively.
FIG. 6 is a schematic diagram of the YO L Ov2 accelerator data flow.
FIG. 7 shows the timing variation between the case where no cache Pipeline is added to the arithmetic unit of the accelerator and the case where the cache Pipeline is added to the arithmetic unit of the accelerator, wherein the Buffer Pipeline method can save time when three tasks are executed; where Buffer In, computer and Buffer Out represent the three stages of completing a computational task.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
The integral hardware architecture of the accelerator designed by the invention is shown in fig. 1, firstly training of a YO L Ov2 model is completed on an upper computer, a VOC data set (VOC2007+2012) is used, 16551 pictures are randomly selected as a training set, 16492 pictures are selected as a test set, then a model fixed-point task is performed, an edge algorithm is completed on an embedded end, an ARM core is integrated on a PS end, a L inux operating system is mounted, a Python language environment is reserved when the operating system is transplanted, a CPU can control all interfaces between the PS and a P L, the accelerator inputs a feature diagram of the YO L O model into a DDR cache through CPU scheduling, the DDR cache interacts with peripheral circuits of the operating system through a bus, the CPU can read an accelerating circuit operation result through an AXI bus, and image preprocessing and display are performed on the PS end.
In the P L logic part, data in an external storage DDR is cached in an on-chip RAM, a convolution and pooling circuit of an YO L O accelerator is laid out and wired in an FPGA, finally, a hardware design bit file (Bitstream) and a design instruction file (Tcl) are transmitted to an Overlay of an operating system, a hardware circuit and an IP core operation circuit of the YO L O are analyzed in the Overlay, and finally a data path of the whole hardware accelerator is formed
The invention is further described below by way of example according to the following steps:
training of YO L Ov2 target detection model, wherein Table 1 shows parameter configuration of YO L Ov2 model
Table 1 YO L Ov2 model parameter configuration used in embodiments of the present invention
Figure BDA0002436886230000091
In table 1, C represents a convolutional layer; m represents a pooling layer;
2. and (3) carrying out low localization (L ow-Bit Fixed Point) on the YO L Ov2 model in the step 1, and executing the following operations:
2.1 obtaining the best fixed point quantization method of network, comparing the difference of the parameter square sum of each parameter of network before and after quantization to determine the best fixed point quantization parameter (tail code M)min);
2.2 obtaining YO L Ov2 network layer number R, repeating the process for 2.1 times Q times;
2.3 reading the weight of the current layer, fixing the weight and bias, respectively fixing the weight and bias, changing the 32-bit floating point number into 16-bit fixed point number, including: 1bit sign bit, Mminbit code order, (16-M)min-1) bit tail bits;
2.4 testing the fixed-point model, including the following processes;
2.4.1 randomly selected 16492 sheets from the VOC data set (VOC2007+2012) as the test set.
2.4.2 loading the fixed-point model parameters into a YO L Ov2 target detection model, completing operations such as convolution, pooling and the like, and completing forward reasoning of the network.
2.4.3 calculating the map of the model (mean average accuracy) from the inference results
In the process of data stationing, the storage occupied by the network model is also reduced, compared with the original precision model, in 16-bit stationing, the size of the YO L Ov2 model is reduced by 7 ×, the size of the YO L Ov2 model is reduced by 20 ×, and the size of the Yny-YO L O and the size-YO L O are respectively reduced by 8 × and 12 × in 8-bit stationing compared with the original precision model, as can be seen from FIG. 5, through 16-bit stationing, the precision of the YO L Ov2 model can be ensured, and the size of the model can also be reduced.
3. Designing an FPGA accelerator for YO L Ov 2;
YO L Ov2 accelerator data flow as shown in fig. 6, includes the following processes:
3.1 Input transform: converting the feature map data fetched from the buffer;
3.2 obtaining a convolution kernel conversion result by convolution kernel conversion (Filter transform);
3.3 obtaining a convolution result of Winograd through an inverse transformation function;
3.4 designing a YO L Ov2 convolution module, and constructing a Winograd PE operation unit;
3.4.1 flow of reading convolution operation data, preparing for YO L Ov2 convolution;
3.4.2 transform the eigen map data retrieved from the buffer, wherein the values of both transformation matrices A, B and G can be determined after m and r are determined. As shown in equation (18):
Out=AT[(GFGT)⊙(BTInB)]A (18)
3.4.2.1 Input transform: the U bits are convolved and the transformed feature matrix U is obtained by equation (19):
U=BTInB (19)
3.4.2.2 convolution kernel transform (Filter transform), where F is a convolution kernel parameter, by equation (20), a convolution kernel transform result V is obtained:
V=GTInG (20)
3.4.2.3, the U, V matrix obtained in the step 3.4.2.1 and the step 3.4.2.2 is transmitted to the PE arithmetic unit, and the dot product operation is carried Out through the formula (18) to obtain an output result Out matrix.
The storage optimization and specific time calculation of the P L cache pipeline comprise the following processes:
4.1 storage optimization steps for the P L cache pipeline are as follows:
4.1.1 setting data exchange mode: the data exchange is carried out with the CPU through an external memory DDR DRAM, and the DDR is controlled by an on-chip bus AXI when the DDR exchanges data with an accelerator.
4.1.2 instantiate a FIFO interface behind the AXI bus to ensure that data input and output to the accelerator operation units can be transferred at high frequency and at efficient speed. And adding a Buffer cluster at the input interface of the accelerator operation unit so as to convert the data into a format and wait for time.
4.1.3 In the accelerator input data part, dividing the input Buffer clusters (sets) into Buf _ In1, Buffer _ In2 and Buffer _ In3, and dividing the output Buffer clusters into Buf _ Out1, Buffer _ Out2 and Buffer _ Out 3.
4.2FPGA computation time calculation
4.2.1 obtaining the total time consumption of FPGA to finish one-time operation
4.2.2 obtaining the improved pipeline storage optimization time, pipelining the operations of reading the characteristic diagram, performing convolution calculation, writing the characteristic diagram and the like, and completing a plurality of operations in the same clock period, wherein TsumTo optimize the time required for pre-operation, TBP_sumOptimizing the total time for adopting running water, wherein TsaveTo save time, as shown in fig. 7.
5. Overall YO L Ov2 accelerator performance evaluation
The Winograd algorithm parameter used by the convolutional layer of YO L Ov2 is F (2 × 2,3 × 3), an improved experiment is carried out on the convolution, a YO L O acceleration IP core is generated in Vivado H L S, a hardware bit file and a parameter file are generated in Block design, an operating system of PS schedules hardware logic and allocates acceleration resources, data quantization processing is carried out before model parameters enter FPGA, the data are quantized into fixed 16-bits type data, the average time of processing each picture by a final acceleration platform test is 124ms, and the average detection precision is 78.25%.
Compared with the acceleration of other platforms, the PYNQ platform-based accelerator provided by the invention is compared with the acceleration of other platforms, as shown in Table 2, compared with a GPU platform, the PYNQ platform-based accelerator is not reduced in precision and is greatly improved in power consumption.
TABLE 2 hardware implementation of the YO L O model herein vs. other method Performance
Figure BDA0002436886230000121
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (8)

1. A Winograd YO L Ov2 target detection method based on FPGA acceleration adopts a PYNQ board card, wherein a main control chip of the PYNQ board card comprises a processing system end PS and a programmable logic end P L, wherein the PS end caches a YO L O model and characteristic map data of an image to be detected, the P L end caches parameters of the YO L O model and the image to be detected in an on-chip RAM, and a YO L O accelerator with a Winograd algorithm is deployed to finish model acceleration operation, form a data path of a hardware accelerator, realize target detection of the image to be detected, and also can read operation results of an acceleration circuit and perform image preprocessing and display;
the method comprises the following steps:
A. training a YO L Ov2 target detection network model, and obtaining a weight value weight of the YO L Ov2 target detection network model;
B. and B, performing low-order spotting on the YO L Ov2 target detection network model trained in the step A, wherein the specific process is as follows:
B1. the optimal fixed point quantization method for obtaining the YO L Ov2 target detection network model comprises the steps of comparing the square sum difference of each parameter of the network before and after quantization to determine the optimal fixed point parameter, namely the tail code Mmin
B2. Acquiring the network layer number R of a YO L Ov2 target detection network model;
B3. acquiring the weight of each layer of the YO L Ov2 network, and performing fixed-point processing on the weight value weight and the bias parameter value bias to obtain fixed-point model parameters;
B4. testing the current model parameters according to the fixed-point model parameters obtained in B3, and verifying the accuracy of the model;
C. designing an FPGA accelerator for YO L Ov2, and using a method that Winograd algorithm replaces multiplication operation by addition in the accelerator of YO L Ov2, wherein the method comprises the following steps:
designing YO L Ov2 convolution kernel with Winograd algorithm at a P L end, converting a large amount of multiplication operation into addition operation realized by the Winograd algorithm during convolution operation, accelerating the convolution operation by adopting the Winograd algorithm, calculating m-dimensional characteristic diagram output of a convolution kernel F (m, r) with the convolution kernel size of r by using m + r-1 times of multiplication through the Winograd algorithm, namely outputting image data with m + r-1 pixels as m-dimensional vectors, and using the Winograd algorithm in an accelerator of YO L Ov2 by replacing the multiplication operation with the addition method, wherein the method comprises the following steps:
C1. transforming the characteristic diagram data obtained from the buffer by input transformation to obtain a transformed characteristic matrix transform (In), wherein In is convolution input;
C2. obtaining a convolution kernel conversion result transform (F) through convolution kernel conversion, wherein F is a convolution kernel parameter;
C3. obtaining a convolution calculation result Inverse _ transform (E) of Winograd through an Inverse transformation function, wherein E is a convolution output result;
C4. a convolution module for designing a YO L Ov2 network model, comprising:
c4.1 designing a convolution calculation data stream, and reading a flow of convolution calculation data;
c4.2 constructing a Winograd PE operation unit; dividing a Winograd PE operation unit into three parts, respectively transforming a characteristic diagram and a convolution kernel which enter a convolution unit, and then performing operation; the method comprises the following steps:
c4.2.1, converting the characteristic diagram data obtained from the buffer to obtain a converted characteristic matrix U;
c4.2.2, taking out the convolution kernel parameters stored in the buffer area, and obtaining the transformed characteristic matrix V through transformation;
c4.2.3, transmitting the matrix U, V obtained in the above steps to an arithmetic unit for dot product operation to obtain a matrix M, and obtaining an output result, wherein M represents the number of output characteristic diagrams or channels;
D.P L cache pipeline storage optimization;
D1. aiming at FPGA acceleration, a cache pipeline method is adopted to improve a single cache set into a multi-cache structure for FPGA acceleration; the process is as follows:
d1.1, in a logic part of ZYNQ, data interaction is carried out with a CPU through an external storage DDR DRAM; the DDR is controlled by an on-chip bus AXI when exchanging data with the accelerator;
d1.2 instantiating a FIFO interface behind the AXI bus to enable data input and output to the accelerator operation unit to be transmitted at high speed and high frequency; adding a cache cluster at an input interface of an accelerator operation unit, converting the data into a format and waiting;
d1.3, in the data input part of the accelerator, dividing an input cache cluster into a plurality of parts, and correspondingly dividing an output cache cluster page into a plurality of parts to form a cache pipeline structure; when normal data interaction and transmission are ensured, each cache is fully utilized, and the storage capacity of each cache is utilized to the maximum extent in a hopping period of a clock bus;
through the steps, Winograd YO L Ov2 target detection based on FPGA acceleration is realized, and the target in the image to be detected is quickly obtained.
2. The FPGA acceleration-based Winograd YO L Ov2 target detection method of claim 1, wherein the total time consumed for the FPGA to complete one operation is calculated by the following method:
the input data time of each Buffer is recorded as TinThe time of each time the data in Buffer enters PE unit for operation is marked as TcoAnd the time for taking out the Buffer from the Buffer after the operation of the acceleration unit is finished is recorded as ToutThe time for completing the whole task flow is recorded as Ttask(ii) a Setting the number of completed tasks in the acceleration unit as n, and completing all tasks according to the time sequence of the conventional access structuresumRepresented by formula (12):
Tsum=n×Ttask=n×(Tin+Tco+Tout) (12)
the improved flow memory optimization time is calculated by adopting the following method:
improving a single cache set into a multi-cache structure, and performing three-level flow on the structure; is provided with
Figure FDA0002436886220000021
Figure FDA0002436886220000022
The total task is divided into three stages, and when n tasks are completed, the total time T consumedBP_sumRepresented by formula (13):
Figure FDA0002436886220000031
let the number of tasks n be 3, the time taken to complete the task is represented by equation (14):
Tsum=3×Ttask=3×(Tin+Tco+Tout) (14)
when the buffering process is performed by the buffering pipeline, the time taken to complete the entire task is expressed by equation (15):
Figure FDA0002436886220000032
Tsave=Tsum-TBP_sum
=n×(Tin+Tco+Tout)-{Tin+max(Tin,Tco)+max(Tin,Tco,Tout)×[n-(3-1)]+max(Tco,Tout)+Tout} (17)
saved time TsaveRepresented by formula (17).
3. The FPGA-acceleration-based Winograd YO L Ov2 target detection method as claimed in claim 1, wherein the operation result of the P L end model is read out through an AXI bus of the PS end, and image preprocessing and display are performed at the PS end.
4. The FPGA-based accelerated Winograd YO L Ov2 target detection method of claim 1, wherein the step B1 is to obtain an optimal fixed-point quantization method of a YO L Ov2 target detection network model, and specifically, the optimal fixed-point quantization parameter, namely, the tail code M, is determined by comparing the difference of the square sums of the parameters of the network before and after quantization through a formula (1)min
Figure FDA0002436886220000033
Wherein, WfloatAn arbitrary weight parameter representing a layer, the original floating-point value, W (bw, M) representing W, given a bit width bw and a level code MfloatConverted back to the new floating point number W 'of the floating point after fixed point conversion'float
5. The FPGA-based accelerated Winograd YO L Ov2 target detection method according to claim 4, wherein the step B3 reads the weight of the current layer of the YO L Ov2 target detection network model, respectively fixes the weight value and the bias parameter value, and specifically changes 32-bit floating point number into 16-bit fixed point number including 1-bit sign bit, M-bit sign bitminbit order, 16-Mmin-1bit tail bits.
6. The FPGA-based accelerated Winograd YO L Ov2 target detection method of claim 1, wherein the step B4 tests current model parameters to verify the accuracy of the model, comprises the steps of:
b4.1 randomly selecting 16492 images from the VOC data set as a test set;
b4.2, loading the fixed-point model parameters into a YO L Ov2 target detection model, and carrying out forward model reasoning;
and B4.3, calculating the average precision of the model according to the inference result.
7. The FPGA acceleration-based Winograd YO L Ov2 target detection method as claimed in claim 1, wherein the FPGA accelerator for YO L Ov2 is designed in step C, the method comprises the steps of designing a YO L Ov2 convolution kernel with a Winograd algorithm at the P L end, accelerating convolution operation by adopting the Winograd algorithm, calculating m-dimensional feature map output with the convolution kernel size of r convolution kernel F (m, r) by multiplying v (F (m, r)) m + r-1 times by using the Winograd algorithm, and expressing the convolution kernel size as formula (2) by using a Winograd minimum filter algorithm under the condition that the output vector is 2-dimensional and the convolution kernel size is 3-dimensional:
Figure FDA0002436886220000041
m0=(d0-d2)g0
Figure FDA0002436886220000042
Figure FDA0002436886220000043
m3=(d1-d3)g2
wherein d isiRepresenting input feature map data in image convolution operations, diRepresenting convolution kernel data, miRepresenting the output data; the input of the Winograd algorithm is image data of m + r < -1 > pixels, and the output is a vector of m dimension; in equation (2), 4-pixel image data is input, and a 2-dimensional vector is output.
8. The FPGA acceleration-based Winograd YO L Ov2 target detection method of claim 7, wherein the step C1 transforms the feature map data fetched from the buffer by input conversion:
determining values of output transformation matrices A, B and G from the m and r values; specifically, a transformed feature matrix transform (in) is obtained by the following formula (3):
Transform(In)=BTInB formula (3)
Step C2 specifically obtains the convolution kernel conversion result transform (f) by equation (4):
Transform(F)=GTFG (4)
Inverse_Transform(E)=ATEA (5)
specifically, in step C3, the convolution calculation result Inverse _ transform (e) is obtained by performing an Inverse transformation function on equation (5).
CN202010254820.9A 2020-04-02 2020-04-02 Winograd YOLOv2 target detection model method based on FPGA acceleration Active CN111459877B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010254820.9A CN111459877B (en) 2020-04-02 2020-04-02 Winograd YOLOv2 target detection model method based on FPGA acceleration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010254820.9A CN111459877B (en) 2020-04-02 2020-04-02 Winograd YOLOv2 target detection model method based on FPGA acceleration

Publications (2)

Publication Number Publication Date
CN111459877A true CN111459877A (en) 2020-07-28
CN111459877B CN111459877B (en) 2023-03-24

Family

ID=71684367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010254820.9A Active CN111459877B (en) 2020-04-02 2020-04-02 Winograd YOLOv2 target detection model method based on FPGA acceleration

Country Status (1)

Country Link
CN (1) CN111459877B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162942A (en) * 2020-09-30 2021-01-01 南京蕴智科技有限公司 Multi-modal image processing hardware acceleration system
CN112330524A (en) * 2020-10-26 2021-02-05 沈阳上博智像科技有限公司 Device and method for quickly realizing convolution in image tracking system
CN112418248A (en) * 2020-11-19 2021-02-26 江苏禹空间科技有限公司 Target detection method and system based on FPGA accelerator
CN113128831A (en) * 2021-03-11 2021-07-16 特斯联科技集团有限公司 People flow guiding method and device based on edge calculation, computer equipment and storage medium
CN113139519A (en) * 2021-05-14 2021-07-20 陕西科技大学 Target detection system based on fully programmable system on chip
CN113269726A (en) * 2021-04-29 2021-08-17 中国电子科技集团公司信息科学研究院 Hyperspectral image target detection method and device
CN113301221A (en) * 2021-03-19 2021-08-24 西安电子科技大学 Image processing method, system and application of depth network camera
CN113392973A (en) * 2021-06-25 2021-09-14 广东工业大学 AI chip neural network acceleration method based on FPGA
CN113392963A (en) * 2021-05-08 2021-09-14 北京化工大学 CNN hardware acceleration system design method based on FPGA
CN113592702A (en) * 2021-08-06 2021-11-02 厘壮信息科技(苏州)有限公司 Image algorithm accelerator, system and method based on deep convolutional neural network
CN113744220A (en) * 2021-08-25 2021-12-03 中国科学院国家空间科学中心 PYNQ-based preselection-frame-free detection system
CN113762483A (en) * 2021-09-16 2021-12-07 华中科技大学 1D U-net neural network processor for electrocardiosignal segmentation
CN113837054A (en) * 2021-09-18 2021-12-24 兰州大学 Railway crossing train recognition early warning system based on monocular vision
CN113962361A (en) * 2021-10-09 2022-01-21 西安交通大学 Winograd-based data conflict-free scheduling method for CNN accelerator system
CN114662681A (en) * 2022-01-19 2022-06-24 北京工业大学 YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly
CN115392168A (en) * 2022-09-01 2022-11-25 北京工商大学 Boxing method for FPGA (field programmable Gate array) chips
CN115457363A (en) * 2022-08-10 2022-12-09 暨南大学 Image target detection method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046894A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Method for optimizing an artificial neural network (ann)
CN107993186A (en) * 2017-12-14 2018-05-04 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN110175670A (en) * 2019-04-09 2019-08-27 华中科技大学 A kind of method and system for realizing YOLOv2 detection network based on FPGA
CN110555516A (en) * 2019-08-27 2019-12-10 上海交通大学 FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046894A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Method for optimizing an artificial neural network (ann)
CN107993186A (en) * 2017-12-14 2018-05-04 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN110175670A (en) * 2019-04-09 2019-08-27 华中科技大学 A kind of method and system for realizing YOLOv2 detection network based on FPGA
CN110555516A (en) * 2019-08-27 2019-12-10 上海交通大学 FPGA-based YOLOv2-tiny neural network low-delay hardware accelerator implementation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨晋生等: "基于深度可分离卷积的交通标志识别算法", 《液晶与显示》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162942A (en) * 2020-09-30 2021-01-01 南京蕴智科技有限公司 Multi-modal image processing hardware acceleration system
CN112330524A (en) * 2020-10-26 2021-02-05 沈阳上博智像科技有限公司 Device and method for quickly realizing convolution in image tracking system
CN112418248A (en) * 2020-11-19 2021-02-26 江苏禹空间科技有限公司 Target detection method and system based on FPGA accelerator
CN112418248B (en) * 2020-11-19 2024-02-09 无锡禹空间智能科技有限公司 Target detection method and system based on FPGA accelerator
CN113128831A (en) * 2021-03-11 2021-07-16 特斯联科技集团有限公司 People flow guiding method and device based on edge calculation, computer equipment and storage medium
CN113301221A (en) * 2021-03-19 2021-08-24 西安电子科技大学 Image processing method, system and application of depth network camera
CN113269726A (en) * 2021-04-29 2021-08-17 中国电子科技集团公司信息科学研究院 Hyperspectral image target detection method and device
CN113392963A (en) * 2021-05-08 2021-09-14 北京化工大学 CNN hardware acceleration system design method based on FPGA
CN113392963B (en) * 2021-05-08 2023-12-19 北京化工大学 FPGA-based CNN hardware acceleration system design method
CN113139519A (en) * 2021-05-14 2021-07-20 陕西科技大学 Target detection system based on fully programmable system on chip
CN113139519B (en) * 2021-05-14 2023-12-22 陕西科技大学 Target detection system based on fully programmable system-on-chip
CN113392973A (en) * 2021-06-25 2021-09-14 广东工业大学 AI chip neural network acceleration method based on FPGA
CN113392973B (en) * 2021-06-25 2023-01-13 广东工业大学 AI chip neural network acceleration method based on FPGA
CN113592702A (en) * 2021-08-06 2021-11-02 厘壮信息科技(苏州)有限公司 Image algorithm accelerator, system and method based on deep convolutional neural network
CN113744220A (en) * 2021-08-25 2021-12-03 中国科学院国家空间科学中心 PYNQ-based preselection-frame-free detection system
CN113744220B (en) * 2021-08-25 2024-03-26 中国科学院国家空间科学中心 PYNQ-based detection system without preselection frame
CN113762483A (en) * 2021-09-16 2021-12-07 华中科技大学 1D U-net neural network processor for electrocardiosignal segmentation
CN113762483B (en) * 2021-09-16 2024-02-09 华中科技大学 1D U-net neural network processor for electrocardiosignal segmentation
CN113837054A (en) * 2021-09-18 2021-12-24 兰州大学 Railway crossing train recognition early warning system based on monocular vision
CN113962361A (en) * 2021-10-09 2022-01-21 西安交通大学 Winograd-based data conflict-free scheduling method for CNN accelerator system
CN113962361B (en) * 2021-10-09 2024-04-05 西安交通大学 Winograd-based CNN accelerator system data conflict-free scheduling method
CN114662681A (en) * 2022-01-19 2022-06-24 北京工业大学 YOLO algorithm-oriented general hardware accelerator system platform capable of being deployed rapidly
CN114662681B (en) * 2022-01-19 2024-05-28 北京工业大学 YOLO algorithm-oriented general hardware accelerator system platform capable of being rapidly deployed
CN115457363B (en) * 2022-08-10 2023-08-04 暨南大学 Image target detection method and system
CN115457363A (en) * 2022-08-10 2022-12-09 暨南大学 Image target detection method and system
CN115392168A (en) * 2022-09-01 2022-11-25 北京工商大学 Boxing method for FPGA (field programmable Gate array) chips

Also Published As

Publication number Publication date
CN111459877B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
US10691996B2 (en) Hardware accelerator for compressed LSTM
CN109284817B (en) Deep separable convolutional neural network processing architecture/method/system and medium
CN111967468A (en) FPGA-based lightweight target detection neural network implementation method
CN111414994B (en) FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN110163353B (en) Computing device and method
CN111178518A (en) Software and hardware cooperative acceleration method based on FPGA
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
CN113792621B (en) FPGA-based target detection accelerator design method
Sun et al. A high-performance accelerator for large-scale convolutional neural networks
CN113361695A (en) Convolutional neural network accelerator
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
Xiao et al. FPGA-based scalable and highly concurrent convolutional neural network acceleration
Yin et al. FPGA-based high-performance CNN accelerator architecture with high DSP utilization and efficient scheduling mode
CN116822600A (en) Neural network search chip based on RISC-V architecture
CN114925780A (en) Optimization and acceleration method of lightweight CNN classifier based on FPGA
CN112001492A (en) Mixed flow type acceleration framework and acceleration method for binary weight Densenet model
Yang et al. A Parallel Processing CNN Accelerator on Embedded Devices Based on Optimized MobileNet
Cheng Design and implementation of convolutional neural network accelerator based on fpga
Chen et al. Edge FPGA-based Onsite Neural Network Training
CN111047024A (en) Computing device and related product
CN113704172B (en) Transposed convolution and convolution accelerator chip design method based on systolic array
Liu et al. A Convolutional Computing Design Using Pulsating Arrays
CN113673704B (en) Relational network reasoning optimization method based on software and hardware cooperative acceleration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant