WO2020215124A1 - Primitive matérielle améliorée pour des mises en œuvre de réseaux neuronaux profonds - Google Patents

Primitive matérielle améliorée pour des mises en œuvre de réseaux neuronaux profonds Download PDF

Info

Publication number
WO2020215124A1
WO2020215124A1 PCT/AU2020/050395 AU2020050395W WO2020215124A1 WO 2020215124 A1 WO2020215124 A1 WO 2020215124A1 AU 2020050395 W AU2020050395 W AU 2020050395W WO 2020215124 A1 WO2020215124 A1 WO 2020215124A1
Authority
WO
WIPO (PCT)
Prior art keywords
dsp
precision
data
low
multiplier
Prior art date
Application number
PCT/AU2020/050395
Other languages
English (en)
Inventor
SeyedRamin Rasoulinezhad
Philip Leong
Original Assignee
The University Of Sydney
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2019901416A external-priority patent/AU2019901416A0/en
Application filed by The University Of Sydney filed Critical The University Of Sydney
Publication of WO2020215124A1 publication Critical patent/WO2020215124A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/48Indexing scheme relating to groups G06F7/48 - G06F7/575
    • G06F2207/4802Special implementations
    • G06F2207/4818Threshold devices
    • G06F2207/4824Neural networks

Definitions

  • the present invention provides for systems and methods for Improvement in Deep Neural Network Architectures
  • DNNs deep neural networks
  • AlexNet proposed in 2012, requires 724M floating-point operations (FLOPs) on 61M parameters in a 5-layer network to achieve 15.3% top-5 error rate on ImageNet
  • ResNetl52 a state of the art convolutional neural network (CNN) uses 11.3B FLOPs over 152 layers to enhance the Top-5 error to 3.6% [3]
  • Modem accelerators have strived to decrease memory footprint and computation requirements of CNNs with minimal compromise in accuracy by using low precision arithmetic operations, particularly for inference [4], [5], [6], [7]
  • Reference [1] compared the implementation of multiply-accumulate (MAC) units with different wordlengths on Xilinx and Intel FPGAs. They reported that using fixed point 8 c 8-bit operations instead of single precision floating point, logic resources are reduced by a factor of 10 - 50 times. This idea has been taken to its conclusion with ternary and binary operations which achieve extremely high speed and low energy on FPGA platforms [8], [9]
  • DSP hard digital signal processing
  • a flexible precision, run time decomposable multiplier system for FPGA architectures which preferably can include run time precision control.
  • the decomposition can be provided by divide and conquer (partitioning) and recursive twin precision techniques.
  • the system can further include a DSP to DSP interconnect, including providing a semi-2D low precision chaining capability which supports the low-precision multiplier.
  • the chaining capability preferably can include allowing a ID DSP column to be operated in a Semi-2D mesh arrangement, reducing the data read access energy by avoiding off- DSP interconnections when data streaming.
  • the data can be forwarded from one DSP to at least two DSPs in data chaining.
  • Predetermined register files within the FPGA architectures are preferably also configured as FIFO data structures.
  • the system implements a series of convolution layers of a Deep Neural Network architecture.
  • a DSP to DSP interconnect including providing a semi-2D low precision chaining capability which supports the low-precision multiplier.
  • the chaining capability preferably can include allowing a ID DSP column to be operated in a Semi-2D mesh arrangement, reducing the data read access energy by avoiding off- DSP interconnections when data streaming.
  • an FPGA architecture configuring register files within the FPGA architecture as dual use FIFO data structures.
  • Fig. 1 illustrates a Xilinx DSP48E2 schematic
  • FIG. 2 is a schematic block diagram of a modified DSP48E2 device
  • FIG. 3 is a schematic block diagram of the divide and conquer technique used in the embodiments.
  • FIG. 5 illustrates conventional and modified processing elements used in the embodiments showing a) 2D processing unit architecture in [11] b) 3x3 convolution layer implementation on 2D architecture c) Our Semi-2D DSP arrangement d) conventional FPGA column-based arrangement
  • Fig. 6 illustrates proposed implementation for standard and DW (left), and PW convolution layers (right);
  • Fig. 7 illustrates an implementation approach for different DW convolution kernel sizes.
  • the embodiments provide a novel precision, interconnect and reuse optimised DSP block, (PIR-DSP), which is optimised for implementing area-efficient DNNs.
  • PIR-DSP optimised DSP block
  • ⁇ Interconnect A DSP interconnection scheme which provides support for semi-2D connections and low-precision streaming.
  • PIR-DSP is implemented as a parameterized module generator which can target FPGAs or ASICs.
  • K is the D K xD K xM depth-wise kernel and the m’th filter of K is applied to the m’th channel of F to produce the m’th channel of G.
  • Linear combinations of the M depth-wise layer outputs are then used to form the N outputs, these being called l x l point-wise convolutions.
  • Modem GPUs are presently the most popular solution for high-performance DNN implementation and Google’s
  • Tensor Processing Unit is an application specific integrated circuit (ASIC) for accelerating DNNs [34]
  • ASIC application specific integrated circuit
  • FPGA architectures are more customizable and can support arbitrary precision MAC operations using fine-grained logic resources [8], [9], [35], [36], [37]
  • FPGA systems are able to efficiently implement a range of parallel and sequential computations. They allow the data path to be better customized for an application, enabling designs to be more highly optimized, particularly in inference for processing single input feature maps (to minimize latency) and to support low precision. Datapaths are most efficient when operations can be implemented using hard DSP resources.
  • Xilinx DSP48E2 The Xilinx DSP48E2 DSP [40] in UltraScale architecture can perform 27x 18 MAC operations and is illustrated in Fig. 1. It includes a 27-bit pre-adder, 48- bit accumulator, and 48-bit arithmetic logic unit (ALU). Dual SIMD 24-bit or quad 12-bit ADD/SUB operations can be computed in the ALU, and other DSP48E2 features include pattern matching and ID unidirectional chaining connections.
  • the DSPs can be cascaded to form a higher precision multiplier, and optional pipeline registers are present. In the DSP48E2, the SIMD wordlength can be changed at run-time.
  • Intel DSPs The Intel DSP [41] supports one 27x27 or two 18x 18 multiplications.
  • Precision is compile-time rather than run-time configurable and there is no pattern matching unit.
  • a pre-adder is implemented as well as two read-only register files (RF)s which can be initialized at compile time and jointly operated as a higher precision RF.
  • RF read-only register files
  • the multiplier decomposition strategy is based on two approaches: Divide-and-Conquer and Recursive Twin-Precision.
  • a signed 2’s complement number can be represented as the sum of one signed (the most significant part) and an unsigned term: [0052] where the k-th bit is the dividing point and the A3 ⁇ 4 and A u are respectively signed and unsigned portions.
  • Equation 4 applied to an N c M-bit multiplier with chopping size C, where N, M, and C are respectively 27, 18, and 9.
  • Fig. 3(a) standard multiplication is done by adding six partial results with appropriate shifts.
  • Fig. 3(b) shows that by controlling the shift steps for the first, fourth and fifth partial results, the summation can be arranged into two separate columns, where each column calculates a 3-CxC-bit- MAC operation with separated carry-in signals:
  • the multiplier is parameterized by chopping factors (separately for each of the two inputs) and the depth.
  • chopping factors the numbers of times we chop M and N
  • k the Recursive Twin-Precision depth factor
  • a generator can be developed which uses these techniques to convert any size multiplier to a MAC-IP.
  • a sign-magnitude format is used so each operand can be signed or unsigned, this being controllable at run-time.
  • FIG. 5(a) shows a 2D PE architecture, proposed in reference [11], which is a NxM mesh network of PEs with unidirectional communications occurring in horizontal, vertical and diagonal directions.
  • Figure 5(b) a 3x3 convolutional layer is assigned to three rows of the PEs. By rearranging this three-row architecture as shown in Figure 5(c), we organize them as a column.
  • Figure 5(d) shows a column-based connection which is capable of forwarding the data/result to the next DSP block. This addresses the difficulty of implementing a 2D interconnection on a ID array, by supporting data forwarding to two DSPs instead of a single one. This is particularly effective for the case where one dimension is small (e.g. 3 elements for 3 c 3 convolutional layers).
  • each input/parameter takes part in many MAC operation, so it is important to cache fetched data. Since data movement contributes more to energy consumption than computation, this leads higher performance and energy reduction [11], [12]
  • Xilinx DSP-blocks do not support caching of data (this is done using the fine-grained resources or hard memory blocks).
  • Intel DSPs do include a small embedded memory for each 18-bit multiplier, but they cannot be configured at run-time and hence can only be used efficiently for fixed coefficients, making them unsuitable for buffering of data for practical sized DNNs.
  • a small and flexible first-in-first-out register file can be provided to enhance data reuse.
  • This is a wide shift register can be loaded sequentially and can be read by two standard read ports.
  • the two read port address signals can be provided from outside the DSP-block.
  • the first is used inside the DSP and brings the requested and the next data for multiplier and multiplexer units (two 27-bit read ports are needed to feed the multiplier).
  • the other port is used to select the data for DSP-DSP chaining connections.
  • RFs are mostly used to buffer a chunk of data inside the DSP, writes always occur as a burst.
  • we arrange the RF as a flexible FIFO. By adjusting the FIFO length, systolic array implementations with different buffering patterns can be implemented.
  • the schematic of our implemented FIFO/RF is given in Figure 2, and operates on input A.
  • DSP48E2 is the most recent version including three major architectural upgrades; wider multiplier unit (27 c 18 instead of 25 c 18), pre-adder module, and wide XOR circuit [49]
  • the baseline DSP48E2 multiplier produces two temporary results, and these are added using the ALU to produce the final MAC output.
  • Modifications to the ALU also required replacing the DSP48E2 12/24/48-bit SIMD add/sub operations with a 4/8/18/48-bit SIMD which leads to smaller and width-variant ALUs since they must be aligned with the carry propagation blocking points, as shown in Figure 2.
  • Table III shows the area and performance results for different PIR-DSP variations.
  • upgrading the multiplier to a 27x l8C32D2 MAC-IP improvements in MAC capabilities of x6, x l2, x24 times for 9, 4, 2-bit MAC operations respectively are gained, at the cost of a 14% increase in area.
  • Configurations # 1 to #3 in Table III show the synthesis results obtained by simply replacing the multiplier and ALU units.
  • Configuration #4 is achieved by modifying the multiplier (in the 27x l8C32D2 configuration) and including the interconnect optimization.
  • Configuration #5 is the final implementation of PIR-DSP which includes all three modifications.
  • RF width and size are selected respectively, to fully feed the multiplier/pre-adder in high/low-precision and to be similar to Intel DSP-block read-only RFs which are configured in two 8x 18-bit memories per DSP.
  • To estimate input B energy which operates as a SR and a normal register we used results for high-performance [53] and low energy flip-flops [54] (FF) to obtain estimates of 180 fj and 90 fj respectively.
  • Energy required to transfer data from DSP-DSP was obtained from reference [55], and scaled to 65nm technology, to obtain 2 pj per byte.
  • energy consumption for 9/4/2 -bit MAC operations are 89/44/22 c that of a 9-bit register.
  • Table V summarizes the estimated energy ratios for data movement. We further assume that all elements (except the MAC) scale linearly with word length.
  • Each filter and input element are respectively used Fh c Fw and Kh c Kw times.
  • the average energy for the described data flow where EM AC is the energy consumption of the MAC computation is
  • each PIR-DSP can compute 2/4/8 sets of three-MAC operations for 9/4/2 -bit precision.
  • Each three-MAC operation can be used for a row of a 3x3 DW kernel.
  • Cascading three PIR-DSPs we can sum the partial outputs to produce the final output feature map elements.
  • each PIR-DSP receives two streams of 9-bit data (as each PIR-DSP can compute two parallel three-MAC operations).
  • the three -cascaded PIR-DSPs can forward two of their streams to the next three-cascaded PIR-DSP over the DSP-DSP chains, and we can implement K rows of 2/4/8 channels of the output matrix for 9/4/2 -bit precision using a column of 3K PIR-DSPs.
  • Ei nput becomes: [0087] where NoF is the number of forwarding over chains for each input stream (2 in our case as each row of the input stream is involved in three rows of output feature map).
  • kernel tiling approach with tile size of 3x3, 2x3, and 1 x3 which are respectively the computation capabilities of a three-cascaded, two-cascaded, and a PIR- DSP.
  • implementing a 5x5 kernel can be done by 2 c three-cascaded DSPs and 2x two-cascaded DSP groups where NoF is 6.
  • the Ei np ut can be reduced by the factor of RFsize according to the last line of Table VI.
  • the calculated access energy ratio in the last column indicates that PIR-DSP uses 31% of the data access energy for a middle bottleneck layer of MobileNetv2 [14] which applies 192 depth-wise 3x3 filters on an input feature map of shape 56 2 c 192.
  • each input channel can be streamed into a DSP to multiplied by corresponding weight parameter, producing a partial result which is cascaded and summed to produce an entry of the output feature map.
  • a PIR-DSP implementation we assign three channels of input and three corresponding channels of 2/4/8 PW kernels to a PID-DSP, depending on operation precision.
  • PIR-DSP using 2, 4, or 8 three-MAC operations computes partial results of applying each filter on same input stream in parallel (the stream includes one element of three channels of input feature map in each cycle). By cascading we change it to 2, 4, or 8 six-MAC operations (computing six elements of the PW kernels).
  • each two-cascaded PIR-DSP can forward their streams to next two-cascaded DSP which leads to energy reduction as summarized in Table VI.
  • PIR-DSP uses saved weights and performs a MAC with the 2/4/8 3-channel weight parameters which are saved in two 27-bit registers.
  • the RF improves input data reuse.
  • BitFusion is an ASIC DNN accelerator, supporting multi-precision MACs. The reported area is for a computation unit including 16 Bit-bricks, and supporting 8x8 multipliers, in 45-nm technology. This unit is similar to our 27xl8C32D2 MAC-IP (Table II), although BitFusion is more flexible as it supports more variations including 2x4, 2x8 and 4x8. Table VIII compares Performance per Area (PPA).
  • Boutros et. al. proposed improvements to the Intel DSP- block [45], and is capable of 27x27 and reduced precision MACs down to 4-bit.
  • PIR-DSP is a flexible module generator, can support precisions down to 2 bits, has better performance at 8 c 8 bits and lower, but is worse at 16 c 16 and higher PPA. It is not possible to compare energy but we would expect Boutros to be similar to the Baseline case in Table VI with PIR-DSP having significant advantages due to the interconnect and reuse optimizations.
  • WP486 Deep Learning with INT8 Optimization on Xilinx Devices, Xilinx Inc,
  • any one of the terms comprising, comprised of or which comprises is an open term that means including at least the elements/features that follow, but not excluding others.
  • the term comprising, when used in the claims should not be interpreted as being limitative to the means or elements or steps listed thereafter.
  • the scope of the expression a device comprising A and B should not be limited to devices consisting only of elements A and B.
  • Any one of the terms including or which includes or that includes as used herein is also an open term that also means including at least the elements/features that follow the term, but not excluding others. Thus, including is synonymous with and means comprising.
  • the term“exemplary” is used in the sense of providing examples, as opposed to indicating quality. That is, an“exemplary embodiment” is an embodiment provided as an example, as opposed to necessarily being an embodiment of exemplary quality.
  • Coupled when used in the claims, should not be interpreted as being limited to direct connections only.
  • the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other.
  • the scope of the expression a device A coupled to a device B should not be limited to devices or systems wherein an output of device A is directly connected to an input of device B. It means that there exists a path between an output of A and an input of B which may be a path including other devices or means.
  • Coupled may mean that two or more elements are either in direct physical or electrical contact, or that two or more elements are not in direct contact with each other but yet still co-operate or interact with each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

Selon la présente invention, la quantification est une stratégie d'optimisation clé pour améliorer la performance d'accélérateurs d'un réseau neuronal profond (DNN) à virgule flottante. Les accélérateurs à base de FPGA utilisent habituellement des ressources de grande précision telles que des tables de consultation (LUT), car les blocs de traitement de signal numérique (DSP) disponibles sur les FPGA ne sont pas utilisés efficacement lorsqu'ils sont appliqués à des calculs de faible précision. Ce problème est abordé pour les calculs les plus importants dans des accélérateurs DNN intégrés, à savoir les couches de convolution standard, par profondeur et par points par l'intermédiaire de trois modifications apportées aux blocs DSP Xilinx DSP48E2. D'abord, l'invention concerne une architecture de multiplicateur décomposable en temps d'exécution, de précision flexible pour des mises en œuvre de CNN. Deuxièmement, une mise à jour considérable de l'interconnexion DSP-DSP est proposée, fournissant une capacité de chaînage de basse précision semi -2D qui prend en charge notre multiplicateur de basse précision. Ceci permet à une colonne de DSP 1D d'être exploitée dans un agencement de maillage semi -2D, réduisant l'énergie d'accès de lecture de données en évitant les interconnexions hors DSP lors de la diffusion en continu de données. L'invention concerne également la réutilisation de données par l'intermédiaire d'un fichier de registre qui peut également être configuré comme FIFO.
PCT/AU2020/050395 2019-04-26 2020-04-24 Primitive matérielle améliorée pour des mises en œuvre de réseaux neuronaux profonds WO2020215124A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AU2019901416A AU2019901416A0 (en) 2019-04-26 An improved hardware primitive for implementations of Deep Neural Networks
AU2019901416 2019-04-26

Publications (1)

Publication Number Publication Date
WO2020215124A1 true WO2020215124A1 (fr) 2020-10-29

Family

ID=72940554

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2020/050395 WO2020215124A1 (fr) 2019-04-26 2020-04-24 Primitive matérielle améliorée pour des mises en œuvre de réseaux neuronaux profonds

Country Status (1)

Country Link
WO (1) WO2020215124A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112734020A (zh) * 2020-12-28 2021-04-30 中国电子科技集团公司第十五研究所 卷积神经网络的卷积乘累加硬件加速装置、系统以及方法
CN112783531A (zh) * 2021-01-29 2021-05-11 湖北三江航天红峰控制有限公司 一种fpga与dsp架构下以太网升级dsp程序方法
CN113033794A (zh) * 2021-03-29 2021-06-25 重庆大学 基于深度可分离卷积的轻量级神经网络硬件加速器
CN113568597A (zh) * 2021-07-15 2021-10-29 上海交通大学 面向卷积神经网络的dsp紧缩字乘法方法及系统
CN113610222A (zh) * 2021-07-07 2021-11-05 绍兴埃瓦科技有限公司 计算神经网络卷积运算的方法及系统、硬件装置
CN116882467A (zh) * 2023-09-01 2023-10-13 中国科学院长春光学精密机械与物理研究所 面向边缘端的多模式可配置的神经网络加速器电路结构

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US949539A (en) * 1909-09-22 1910-02-15 Oscar Katzenberger Vaginal syringe.
US20130135008A1 (en) * 2009-12-01 2013-05-30 Trustees Of Princeton University Method and system for a run-time reconfigurable computer architecture
US8468335B2 (en) * 2009-01-21 2013-06-18 Shanghai Xin Hao Micro Electronics Co. Ltd. Reconfigurable system having plurality of basic function units with each unit having a plurality of multiplexers and other logics for performing at least one of a logic operation or arithmetic operation
US8495122B2 (en) * 2003-12-29 2013-07-23 Xilinx, Inc. Programmable device with dynamic DSP architecture
US8583569B2 (en) * 2007-04-19 2013-11-12 Microsoft Corporation Field-programmable gate array based accelerator system
WO2017003887A1 (fr) * 2015-06-29 2017-01-05 Microsoft Technology Licensing, Llc Réseaux de neurones à convolution sur accélérateurs matériels

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US949539A (en) * 1909-09-22 1910-02-15 Oscar Katzenberger Vaginal syringe.
US8495122B2 (en) * 2003-12-29 2013-07-23 Xilinx, Inc. Programmable device with dynamic DSP architecture
US8583569B2 (en) * 2007-04-19 2013-11-12 Microsoft Corporation Field-programmable gate array based accelerator system
US8468335B2 (en) * 2009-01-21 2013-06-18 Shanghai Xin Hao Micro Electronics Co. Ltd. Reconfigurable system having plurality of basic function units with each unit having a plurality of multiplexers and other logics for performing at least one of a logic operation or arithmetic operation
US20130135008A1 (en) * 2009-12-01 2013-05-30 Trustees Of Princeton University Method and system for a run-time reconfigurable computer architecture
WO2017003887A1 (fr) * 2015-06-29 2017-01-05 Microsoft Technology Licensing, Llc Réseaux de neurones à convolution sur accélérateurs matériels

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LANGHAMMER, M. ET AL.: "High Density and Performance Multiplication for FPGA", IEEE 25TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 25 June 2018 (2018-06-25), pages 5 - 12, XP033400124, DOI: 10.1109/ARITH.2018.8464695 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112734020A (zh) * 2020-12-28 2021-04-30 中国电子科技集团公司第十五研究所 卷积神经网络的卷积乘累加硬件加速装置、系统以及方法
CN112783531A (zh) * 2021-01-29 2021-05-11 湖北三江航天红峰控制有限公司 一种fpga与dsp架构下以太网升级dsp程序方法
CN113033794A (zh) * 2021-03-29 2021-06-25 重庆大学 基于深度可分离卷积的轻量级神经网络硬件加速器
CN113033794B (zh) * 2021-03-29 2023-02-28 重庆大学 基于深度可分离卷积的轻量级神经网络硬件加速器
CN113610222A (zh) * 2021-07-07 2021-11-05 绍兴埃瓦科技有限公司 计算神经网络卷积运算的方法及系统、硬件装置
CN113610222B (zh) * 2021-07-07 2024-02-27 绍兴埃瓦科技有限公司 计算神经网络卷积运算的方法及系统、硬件装置
CN113568597A (zh) * 2021-07-15 2021-10-29 上海交通大学 面向卷积神经网络的dsp紧缩字乘法方法及系统
CN116882467A (zh) * 2023-09-01 2023-10-13 中国科学院长春光学精密机械与物理研究所 面向边缘端的多模式可配置的神经网络加速器电路结构
CN116882467B (zh) * 2023-09-01 2023-11-21 中国科学院长春光学精密机械与物理研究所 面向边缘端的多模式可配置的神经网络加速器电路结构

Similar Documents

Publication Publication Date Title
WO2020215124A1 (fr) Primitive matérielle améliorée pour des mises en œuvre de réseaux neuronaux profonds
Rasoulinezhad et al. PIR-DSP: An FPGA DSP block architecture for multi-precision deep neural networks
Boutros et al. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs
Ma et al. Multiplier policies for digital signal processing
US7167890B2 (en) Multiplier-based processor-in-memory architectures for image and graphics processing
JP2024045315A (ja) 再構成可能プロセッサ回路アーキテクチャ
Jaberipur et al. Improving the speed of parallel decimal multiplication
Abdelgawad et al. High speed and area-efficient multiply accumulate (MAC) unit for digital signal prossing applications
Farrukh et al. Power efficient tiny yolo cnn using reduced hardware resources based on booth multiplier and wallace tree adders
Tu et al. Power-efficient pipelined reconfigurable fixed-width Baugh-Wooley multipliers
Perri et al. A high-performance fully reconfigurable FPGA-based 2D convolution processor
Irmak et al. Increasing flexibility of FPGA-based CNN accelerators with dynamic partial reconfiguration
Que et al. Recurrent neural networks with column-wise matrix–vector multiplication on FPGAs
Lu et al. ETA: An efficient training accelerator for DNNs based on hardware-algorithm co-optimization
CN113055060B (zh) 面向大规模mimo信号检测的粗粒度可重构架构系统
Krishna et al. Design of wallace tree multiplier using compressors
Zolfagharinejad et al. Posit process element for using in energy-efficient DNN accelerators
Tsai et al. An on-chip fully connected neural network training hardware accelerator based on brain float point and sparsity awareness
Kuang et al. Energy-efficient multiple-precision floating-point multiplier for embedded applications
US5935202A (en) Compressor circuit in a data processor and method therefor
Haghi et al. O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices
Shao et al. An FPGA-based reconfigurable accelerator for low-bit DNN training
Jadhav et al. A novel high speed FPGA architecture for FIR filter design
Andrews A systolic SBNR adaptive signal processor
Singh et al. Modified booth multiplier with carry select adder using 3-stage pipelining technique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20794639

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20794639

Country of ref document: EP

Kind code of ref document: A1