CN115329951B - FPGA architecture for convolutional neural network fast convolutional operation - Google Patents

FPGA architecture for convolutional neural network fast convolutional operation Download PDF

Info

Publication number
CN115329951B
CN115329951B CN202211112093.8A CN202211112093A CN115329951B CN 115329951 B CN115329951 B CN 115329951B CN 202211112093 A CN202211112093 A CN 202211112093A CN 115329951 B CN115329951 B CN 115329951B
Authority
CN
China
Prior art keywords
conversion module
winograd
fpga
output
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211112093.8A
Other languages
Chinese (zh)
Other versions
CN115329951A (en
Inventor
李皓辰
余乐
关文洋
于重重
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN202211112093.8A priority Critical patent/CN115329951B/en
Publication of CN115329951A publication Critical patent/CN115329951A/en
Application granted granted Critical
Publication of CN115329951B publication Critical patent/CN115329951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to an FPGA architecture for fast convolution operation of a convolution neural network, and belongs to the technical field of FPGA architectures. The framework comprises a plurality of Winograd hard core computing units, wherein the Winograd hard core computing units are arranged in a loose mode in an FPGA; the Winograd hard kernel computing unit comprises an image data conversion module, a weight conversion module, a dot multiplication module based on a fast multiplier and an output conversion module; the input ends of the weight conversion module and the image conversion module receive data, the output ends of the weight conversion module and the image conversion module are input to the dot multiplication module, the output end of the dot multiplication module is input to the input end of the output conversion module, and the output end of the output conversion module outputs outwards; and arranging in a loose mode, wherein LBs of the FPGA are arranged among the Winograd hard core computing units for spacing. According to the invention, the Winograd hard core calculation unit is designed and added to the FPGA, so that the method is different from the method for directly using resources on the FPGA to realize Winograd algorithm, reduces interconnection dependence of LBs, DSP and FPGA during calculation, and improves the maximum clock frequency.

Description

FPGA architecture for convolutional neural network fast convolutional operation
Technical Field
The invention relates to an FPGA architecture for fast convolution operation of a convolution neural network, and belongs to the technical field of FPGA architectures.
Background
Over the past decade, FPGA designs and architectures used to accelerate Machine Learning (ML) algorithms, such as the horizon brain processor (Brain Processing Unit, BPU), IBM TrueNorth, the british DianNaoYu, the alien light 800, etc., have grown in endless numbers, which offer increasing computing resources and memory bandwidth. In addition to this, many FPGA-based solutions are also being proposed. The hundred-degree-push AI cloud computing chip XPU is a cloud computing acceleration chip based on the FPGA. Xilinx's xNN and Intel's DLA are then called overlay processors, which map systolic array based matrix multipliers onto a generic FPGA. These FPGA-based solutions do not modify the architecture of the FPGA itself, and they are implemented using programmable Logic present on current FPGAs, such as Logic Blocks (LBs), multipliers (Digital Signal Process, DSP) and block memory units (Random Access Memory, RAM).
Unlike the above designs, another direction of research is to change the architecture of the FPGA itself to accelerate the ML algorithm. Eldafrawy et al modified the architecture of the programmable logic blocks (Configurable Logic Block, CLBs) to reduce the area consumption of multiplication and addition using soft logic implementations. Aman et al add variable precision tensor cells to existing FPGAs for ML acceleration. Still other articles change the architecture of the DSP to provide accelerated performance. Pir-DSP modifies the architecture of DSP48E2 and also adds registers to better meet the computational requirements of low-precision deep neural networks. Yuan Dai et al propose an improved APIR-DSP based on PIR-DSP, which increases the computational speed while reducing the area consumption.
Approximately 80% to 90% of the operations in convolutional neural networks are convolutional calculations, and the Winograd algorithm has been widely demonstrated to be effective in accelerating convolutional calculations by reducing the number of multiplications. Patent CN111459877a is the acceleration of YOLO v2 convolutional neural networks using Winograd. Patent CN113283587a splits the convolution kernel of non-3×3 size in the convolution calculation, and then performs Winograd calculation acceleration.
In addition, for the application of the vast majority of neural networks, the input of fixed-point type data can achieve good experimental results, the speed can be improved, and the power consumption can be reduced. Generally, for some networks and scenes with low precision requirements, the 8-bit data bit width can meet the precision requirements.
The tensor element mentioned above is suitable for matrix multiplication calculations, but is less effective in calculating convolutions than the Winograd algorithm. Although patent CN113283587a splits the convolution kernel of convolution operations other than 3×3 shapes, the split convolution kernel can still be implemented by Winograd calculation of F (2×2,3×3) size. The Winograd algorithm is able to efficiently optimize the convolution calculations though by reducing the number of multiplications. But Winograd designs using soft logic (LBs and interconnects on FPGAs) are slow and area inefficient. Since Winograd computation eliminates accumulation operations compared to convolution computation, using DSP only for multiplication computation also causes area waste. In addition to the core domain conversion module and multiplier, there is some control logic in the design. Designing these logic using LBs and FPGA interconnects also slows down the overall operation of the design. These all make the implementation of Winograd computational convolution on an FPGA much slower than a dedicated ASIC.
Disclosure of Invention
The invention aims to solve the technical problems that: how to provide a method for realizing the calculation acceleration effect of Winograd by using Winograd algorithm to realize the existing LBs, DSP and interconnection resources on FPGA and enabling the calculation acceleration effect to reach and exceed the calculation acceleration effect of ASIC.
In order to solve the technical problems, the technical scheme provided by the invention is as follows: an FPGA architecture for convolutional neural network fast convolutional operation comprises a plurality of Winograd hard core computing units, wherein the Winograd hard core computing units are distributed in the FPGA in a loose mode;
the Winograd hard kernel computing unit comprises an image data conversion module, a weight conversion module, a dot multiplication module based on a fast multiplier and an output conversion module; the input ends of the weight conversion module and the image conversion module receive data, the output ends of the weight conversion module and the image conversion module are input to the dot multiplication module, the output end of the dot multiplication module is input to the input end of the output conversion module, and the output end of the output conversion module outputs outwards;
the image data conversion module, the weight conversion module and the output conversion module are based on the fast multiplier and are realized through displacement and addition operation;
the point multiplication module is realized by a base 4-Booth encoder and a Wallace tree;
arranging in a loose mode, wherein LBs of the FPGA is arranged between the Winograd hard core computing units for spacing.
The further improvement of the scheme is as follows: the two 8bit numbers received by the input end of the dot multiplication module are respectively a multiplicand and a multiplier, the multiplier is encoded by a base 4-Booth encoder, 4 partial products are generated by the multiplier and the multiplicand after encoding, the partial products are sent into a Wallace tree for 4:2 compression, and the compression result is added by a carry-ahead adder to obtain a final result and is output from an output end to an output conversion module.
The beneficial effects brought by the invention are as follows: according to the invention, the Winograd hard core calculation unit is designed and added to the FPGA, so that the method is different from the method for directly using resources on the FPGA to realize Winograd algorithm, reduces interconnection dependence of LB, DSP and FPGA during calculation, and improves the maximum clock frequency.
In addition, the circuits with specific functions on the FPGA are realized by connecting various modules through interconnection resources, and the loose topological structure reduces the number of channels and the size of a switch box required by a Winograd algorithm in calculating convolution, so that the area delay product is reduced.
Drawings
The invention is further described below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of a Winograd hard kernel computing unit of an FPGA architecture for fast convolution operation of a convolutional neural network according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a dot product module of an FPGA architecture for fast convolution operations of a convolutional neural network according to an embodiment of the present invention.
Fig. 3 illustrates a fast multiplier code scheme according to an embodiment of the present invention.
FIG. 4 illustrates area comparisons of a systolic array, an implementation of an FPGA implementation using existing resources of the FPGA and an FPGA implementation design containing Winograd hard cores, the area scores being a logic block area and a routing area.
FIG. 5 shows a percentage reduction in design area for an FPGA implementation containing Winograd hard cores compared to an existing resource implementation of an FPGA in accordance with an embodiment of the present invention.
FIG. 6 shows a frequency comparison of a systolic array, an implementation using existing resources of an FPGA, and an implementation design of an FPGA containing Winograd hard cores in an embodiment of the present invention.
FIG. 7 shows that the design frequency of the FPGA implementation containing Winograd hard cores is improved by a percentage compared with the design frequency of the prior art of the FPGA.
Fig. 8 is a partial schematic diagram of three topologies, loose, columnar and dense (only the lower left corner of the overall architecture) in an embodiment of the invention.
Fig. 9 illustrates the design area consumption for a loose, columnar and dense topology implementation in an embodiment of the present invention.
Fig. 10 shows a percentage reduction in area achieved by the relaxed and columnar topologies compared to the dense one in the embodiment of the invention.
FIG. 11 illustrates the implementation of the number of design channels for the three topologies of relaxed, columnar, and dense in an embodiment of the invention.
FIG. 12 illustrates a percentage reduction in the number of design channels achieved by the relaxed and columnar topologies compared to the dense implementation of the present invention.
Fig. 13 illustrates the implementation of design frequencies for the three topologies of relaxed, columnar, and dense in an embodiment of the invention.
Fig. 14 shows the percentage increase in frequency versus dense achieved in the relaxed and columnar topologies of the embodiments of the invention.
FIG. 15 illustrates the implementation of the design area delay product in terms of percent reduction in area delay product versus density for the three topologies of relaxed, columnar, and dense in the embodiment of the invention.
Detailed Description
Examples
The FPGA architecture for the convolutional neural network fast convolutional operation of the embodiment comprises a plurality of Winograd hard core computing units, wherein the Winograd hard core computing units are arranged in the FPGA in a loose mode.
The Winograd hard kernel computing unit comprises an image data conversion module, a weight conversion module, a dot multiplication module based on a fast multiplier and an output conversion module; the input ends of the weight conversion module and the image conversion module receive data, the output ends of the weight conversion module and the image conversion module are input to the dot multiplication module, the output end of the dot multiplication module is input to the input end of the output conversion module, and the output end of the output conversion module outputs outwards;
the image data conversion module, the weight conversion module and the output conversion module are based on the fast multiplier and are realized through displacement and addition operation;
the point multiplication module is realized by a base 4-Booth encoder and a Wallace tree;
and arranging in a loose mode, wherein LBs of the FPGA are arranged among the Winograd hard core computing units for spacing.
The Winograd hard kernel calculation unit is based on a two-dimensional Winograd algorithm formula as follows:
Y=A T [(GgG T )⊙(B T dB)]A
"" indicates matrix dot product, Y is the convolution result, G is the convolution kernel transformation matrix, B T To input image data transform matrix, A T To output the transform matrix, g is a convolution kernel of size 3*3 and d is input image data of size 4*4.
For two-dimensional convolution, let the size of the output be mxm, the size of the convolution kernel be rxr, and the two-dimensional convolution can be denoted by F (mxm, rxr). G, B of Winograd of F (2×2,3×3) size employed in the present invention T And A T The following is shown:
the designed Winograd hard kernel computing unit is shown in figure 1 and comprises an image data transformation module, a weight transformation module, a dot multiplication module and an output transformation module.
All the conversion circuits of Winograd with the size of F (2 multiplied by 2,3 multiplied by 3) can be realized through addition and displacement, multiplication is not needed, and the resource consumption can be effectively reduced.
The matrix dot product circuit part consists of 16 8-bit fast multipliers. The multiplier structure is shown in FIG. 2, wherein A is a multiplicand B is a multiplier, and the multiplier comprises a base 4-Booth coding module (Booth Enc), a Partial product generating unit (Gen Port), a Partial product (Partial Port), a 4:2compressor (4:2 compressor,4:2 CSA), carry data (Carry), a pseudo Sum (Sum), and a Carry-ahead adder (Lookahead Carry Adder, LCA). The calculation process is as follows: the multiplier B is encoded by a base 4-Booth encoder, the encoded result is multiplied by a multiplicand to generate 4 partial products, the partial products are sent to a 4-2 compressor for compression, and the compression result is added by LCA to obtain the final multiplication result.
Coding (Encode, enc) the 8-bit wide multiplier B, as shown in FIG. 3, adding an auxiliary bit 0 to the rightmost side, taking a group of adjacent 3 bits successively from low to high, overlapping the front and back adjacent packets with one bit, enc 1 -Enc 4 For 4 sets of numbers obtained, the three digits obtained for each Enc are B i+1 、B i And B i-1 . Coding according to the following formula to obtain 4 coding results, sending the coded results into GEN PORD to generate 4 partial products P with multiplicand A according to the formula 1 -P 4 Because Enc encoding results are only-2, -1, 0, 1, 2, multiplication with a is achieved with only shifting.
Enc=-2*B i+1 +B i +B i-1
PartiaProd=Enc*A
4:2CSA has 4 output ports: comprising 4 data P to be compressed 1 、P 2 、P 3 、P 4 The method comprises the steps of carrying out a first treatment on the surface of the 2 output ports: including Carry data Carry, pseudo and Sum.
The final 16-bit carry-lookahead adder is formed by cascading 4-bit carry-lookahead adders.
DC synthesis is carried out on a Winograd computing unit verilog code, and the obtained delay and area are scaled into an FPGA architecture of 20nm by using a FreePDK45 library for FPGA architecture design. The shape of the hard core is designed to be rectangular same as that of the DSP, so that arrangement layout is performed in the FPGA framework. The input-output pins of the hard core are defined as evenly distributed for better routability.
The Winograd hard core computing unit of the embodiment is placed on the FPGA in a manner designed through the VTR. The FPGA architecture in the VTR is represented in an XML structure description file.
The XML architecture description file has a top layer of < architecture >, 7 modules below the top layer of < modules >, < tiles >, < layout >, < device >, < switchlist >, < segment list >, < directlist > and < complexblock list >, respectively.
The FPGA architecture is modified on the basis of an XML structure description file of the StratixIV, and 4 module modeling under the top layer needs to be modified by adding a Winograd hard core computing unit.
①<models>
< model > is used to declare the model name used in the bilf netlist file. To instantiate a Winograd hard core computation unit in a netlist, the model name and pin name must be declared in XML.
While defining pins, the combined_sink_ports also need to be used to model pin timing dependencies and establish dependencies on input and output.
②<complexblocklist>
The < pb_type > of < complexblocklist > lower order is used to describe the Winograd hard core computing unit internal structure, modeling its interior.
Modeling of < pb_type > contains two layers: top-level modules and modeling primitives.
The top-level module needs to declare the module name and all port information.
Modeling primitives are the lowest layers in the hierarchy. Primitives correspond to elements that appear in the user netlist in the map before the packing stage, and the model in blif must be described in modeling primitives in pb_type. The Winograd hard core computing unit is described in the form of a black box, which means that the interior of the hard core computing unit is not described in detail, and only ports and internal delays are described. The circuit critical path delay obtained through DC synthesis is set by using a delay_constant attribute in the modeling primitive.
Intra-module interconnect is also required between the top-level module and the primitive module. Ports and pins used in the primitive module are interconnected with ports declared by the top module through interconnect elements. < interconnect > is at the same level as the modeling primitive.
③<tiles>
< tiles > describes the Winograd hard core compute unit external structure. The < tile > contains the name, length, width, area, number of input/output pins, pin positions and how many wires of the pins are connected. The circuit area obtained by DC integration is set in < tile >.
④<layout>
< layout > defines layout information of the FPGA architecture, and various physical blocks in < tiles > are arranged in a prescribed order in the grid. Cells within the grid have priority, with high priority overriding low priority. The outermost round of IO is highest in priority, the inside is filled with LBs with lowest priority, and then the hard cores with the next highest priority are arranged in the grid. Winograd hard core computing units are required to be arranged in an FPGA architecture according to a certain rule.
In order to verify that the test circuit of the effect of the Winograd hard core computing unit has the same computing capacity as the pulse array in computing convolution, the number of convolution frames which can be computed in the same clock period is consistent, and the implementation of the Winograd circuit is respectively realized by an FPGA only comprising a DSP, LBs and input/output ports (IO ports, IO) and an FPGA comprising the Winograd hard core computing unit, and the comparison content is the consumed area and the critical path delay on the FPGA.
The channel width means the minimum channel number of the FPGA required under the condition of not influencing the maximum frequency of the circuit, and the minimum channel number required is found by adopting a binary search method. Due to the order and dissimilarity of array elements, three variables are defined by obtaining the relative size relationship between elements in the array: the two boundary variables determine the searching range, one value is the intermediate variable of the intermediate values of the two boundary variables, the intermediate variable is used for comparing the value with the searched value, the value of the boundary variable is changed on the premise of ensuring that the maximum clock frequency of the right boundary is unchanged, the range of the searching interval is further reduced, the intermediate value is determined again, the process is repeated until the channel width is minimum when the maximum clock frequency is unchanged, and the cycle is exited.
Fig. 4, 5, 6 and 7 show the results of Winograd with the same convolution calculation capability as systolic arrays when calculating convolutions, showing the frequency and area variation of using systolic arrays of different sizes and Winograd hard kernels of F (2 x 2,3 x 3) sizes when performing convolution calculations (using a relaxed placement strategy). In computing the convolution, for a Winograd design with the same convolution computation capability as a systolic array of 32×32 size, the total area is reduced by 53% and the clock frequency is increased by 72% when F (2×2,3×3) Winograd hard kernels are used, compared to soft logic implementations. As can be seen in FIG. 3, winograd hard cores achieve a reduced area, mostly the area consumed by the interconnect.
FIG. 8 shows three different topologies on a Winograd hard core computation unit and then an FPGA, with loose, columnar and dense sequentially from left to right. As shown in fig. 9 and 10, the dense architecture has the largest total area consumption, while the loose architecture has the lowest total area consumption in 3 topologies, which is reduced by 43% compared with the total area of the dense architecture. As shown in fig. 11 and 12, the columnar architecture is larger than the channel width of the loose architecture, because the Winograd hard cores have large data amount, the columnar architectures are close to each other to cause higher routing congestion, and the larger channel width also causes increased area consumption, but meanwhile, the delay is correspondingly reduced due to shorter paths, so that the maximum clock frequency is increased by 28% compared with that of the dense architecture, as shown in fig. 13 and 14. As shown in fig. 15, with the area delay product as an evaluation criterion, it can be seen that the relaxed architecture has the smallest area delay product.

Claims (2)

1. An FPGA architecture for fast convolution operations of a convolutional neural network, characterized in that: the FPGA architecture comprises a plurality of Winograd hard core computing units, and the Winograd hard core computing units are arranged in the FPGA in a loose mode;
the Winograd hard kernel computing unit comprises an image data conversion module, a weight conversion module, a dot multiplication module based on a fast multiplier and an output conversion module; the input ends of the weight conversion module and the image conversion module receive data, the output ends of the weight conversion module and the image conversion module are input to the dot multiplication module, the output end of the dot multiplication module is input to the input end of the output conversion module, and the output end of the output conversion module outputs outwards;
the image data conversion module, the weight conversion module and the output conversion module are based on the fast multiplier and are realized through displacement and addition operation;
the point multiplication module is realized by a base 4-Booth encoder and a Wallace tree;
and arranging in a loose mode, wherein the Winograd hard cores are rectangular, 3 LBs are arranged between the Winograd hard cores and the DSP, 1 LB is arranged between the adjacent Winograd hard cores, and 1 LB is arranged between the Winograd hard cores and the I/O.
2. The FPGA architecture for convolutional neural network fast convolution operations of claim 1, wherein: the two 8bit numbers received by the input end of the dot multiplication module are a multiplicand and a multiplier respectively, the multiplier is encoded by a base 4-Booth encoder, 4 partial products are generated by the encoded multiplier and the multiplicand, and the partial products are sent into a Wallace tree for 4: and 2CSA compression, wherein the compression result is added by a carry-look ahead adder to obtain a final result and is output to an output conversion module from an output end.
CN202211112093.8A 2022-09-13 2022-09-13 FPGA architecture for convolutional neural network fast convolutional operation Active CN115329951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211112093.8A CN115329951B (en) 2022-09-13 2022-09-13 FPGA architecture for convolutional neural network fast convolutional operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211112093.8A CN115329951B (en) 2022-09-13 2022-09-13 FPGA architecture for convolutional neural network fast convolutional operation

Publications (2)

Publication Number Publication Date
CN115329951A CN115329951A (en) 2022-11-11
CN115329951B true CN115329951B (en) 2023-09-15

Family

ID=83930414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211112093.8A Active CN115329951B (en) 2022-09-13 2022-09-13 FPGA architecture for convolutional neural network fast convolutional operation

Country Status (1)

Country Link
CN (1) CN115329951B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102355232A (en) * 2011-07-29 2012-02-15 北京航空航天大学 FPGA (field-programmable gate array)-based high-speed FIR (finite impulse response) digital filter
CN109447241A (en) * 2018-09-29 2019-03-08 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field
CN110766128A (en) * 2018-07-26 2020-02-07 北京深鉴智能科技有限公司 Convolution calculation unit, calculation method and neural network calculation platform
WO2021067230A1 (en) * 2019-09-30 2021-04-08 Board Of Regents, The University Of Texas System Field programmable gate array architecture optimized for machine learning applications
CN112949845A (en) * 2021-03-08 2021-06-11 内蒙古大学 Deep convolutional neural network accelerator based on FPGA
CN113283587A (en) * 2021-05-28 2021-08-20 西安交通大学 Winograd convolution operation acceleration method and acceleration module
CN114399036A (en) * 2022-01-12 2022-04-26 电子科技大学 Efficient convolution calculation unit based on one-dimensional Winograd algorithm

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102355232A (en) * 2011-07-29 2012-02-15 北京航空航天大学 FPGA (field-programmable gate array)-based high-speed FIR (finite impulse response) digital filter
CN110766128A (en) * 2018-07-26 2020-02-07 北京深鉴智能科技有限公司 Convolution calculation unit, calculation method and neural network calculation platform
CN109447241A (en) * 2018-09-29 2019-03-08 西安交通大学 A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field
WO2021067230A1 (en) * 2019-09-30 2021-04-08 Board Of Regents, The University Of Texas System Field programmable gate array architecture optimized for machine learning applications
CN112949845A (en) * 2021-03-08 2021-06-11 内蒙古大学 Deep convolutional neural network accelerator based on FPGA
CN113283587A (en) * 2021-05-28 2021-08-20 西安交通大学 Winograd convolution operation acceleration method and acceleration module
CN114399036A (en) * 2022-01-12 2022-04-26 电子科技大学 Efficient convolution calculation unit based on one-dimensional Winograd algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Routing Optimization For Hybrid FPGAs;Chi Wai Yu, et al.;《2009 International Conference on Field-Programmable Technology》;第419-422页 *
一种基于FPGA实现的FFT结构;潘明海等;《微计算机信息》(第16期);第156-158页 *

Also Published As

Publication number Publication date
CN115329951A (en) 2022-11-11

Similar Documents

Publication Publication Date Title
Samimi et al. Res-DNN: A residue number system-based DNN accelerator unit
Jang et al. Sparsity-aware and re-configurable NPU architecture for Samsung flagship mobile SoC
Ryu et al. Bitblade: Area and energy-efficient precision-scalable neural network accelerator with bitwise summation
Yuan et al. High performance CNN accelerators based on hardware and algorithm co-optimization
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN111178519A (en) Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN111291859A (en) Semiconductor circuit for universal matrix-matrix multiplication data stream accelerator
Xie et al. An efficient and flexible accelerator design for sparse convolutional neural networks
Chang et al. VWA: Hardware efficient vectorwise accelerator for convolutional neural network
CN115018062A (en) Convolutional neural network accelerator based on FPGA
CN113283587A (en) Winograd convolution operation acceleration method and acceleration module
Huang et al. A high performance multi-bit-width booth vector systolic accelerator for NAS optimized deep learning neural networks
An et al. 29.3 an 8.09 tops/w neural engine leveraging bit-sparsified sign-magnitude multiplications and dual adder trees
CN115329951B (en) FPGA architecture for convolutional neural network fast convolutional operation
CN107092462B (en) 64-bit asynchronous multiplier based on FPGA
Qureshi et al. NeuroMAX: a high throughput, multi-threaded, log-based accelerator for convolutional neural networks
Bhandari et al. A Novel Design of High-Performance Hybrid Multiplier
Wu et al. An energy-efficient accelerator with relative-indexing memory for sparse compressed convolutional neural network
Kumar et al. Complex multiplier: implementation using efficient algorithms for signal processing application
Yin et al. An efficient hardware accelerator for block sparse convolutional neural networks on fpga
Vaithiyanathan et al. Performance Analysis of 8-Point Approximate DCT Architecture Using Conventional and Hybrid Adders
Ohta et al. New FPGA architecture for bit-serial pipeline datapath
Cao et al. Efficient LUT-based FPGA accelerator design for universal quantized CNN inference
kumar Varshney et al. Deployment of Braun Multiplier Using Novel Adder Formulations
Li A single precision floating point multiplier for machine learning hardware acceleration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant