CN115080503A - Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping - Google Patents

Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping Download PDF

Info

Publication number
CN115080503A
CN115080503A CN202210894357.3A CN202210894357A CN115080503A CN 115080503 A CN115080503 A CN 115080503A CN 202210894357 A CN202210894357 A CN 202210894357A CN 115080503 A CN115080503 A CN 115080503A
Authority
CN
China
Prior art keywords
processing unit
reconfigurable
fft
reconfigurable processing
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210894357.3A
Other languages
Chinese (zh)
Inventor
徐安林
张强
刘念
梁小虎
郝万宏
陈昊
杨欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
63921 Troops of PLA
Original Assignee
63921 Troops of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 63921 Troops of PLA filed Critical 63921 Troops of PLA
Priority to CN202210894357.3A priority Critical patent/CN115080503A/en
Publication of CN115080503A publication Critical patent/CN115080503A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm

Abstract

The invention relates to a systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping, which comprises: the reconfigurable processing unit array, the shared memory, the main controller and the on-chip memory; the reconfigurable processing unit array is responsible for FFT operation and comprises m multiplied by n reconfigurable processing units; the main controller is used for analyzing the configuration packet and writing configuration information into a configuration memory in each reconfigurable processing unit, the reconfigurable processing units execute corresponding operations under the dual drive of data flow and configuration flow, each reconfigurable processing unit and the interconnection among the reconfigurable processing units can be configured independently, and the reconfigurable processing unit array can be dynamically divided into subarrays for algorithm level parallel processing to realize acceleration; the shared memory is a plurality of groups of memories and mainly has two functions, namely, the shared memory is responsible for data interaction with the on-chip memory, and the shared memory stores intermediate data generated by each stage of FFT operation; the on-chip memory is used for storing programs, configuration information and data.

Description

Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping
Technical Field
The invention relates to the field of computer systems, in particular to a multi-level storage structure systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping.
Background
With the rapid development of information technology, the demand for signal processing capability in computationally intensive fields such as computers, communications, consumer electronics, and the like is increasing. As an important means for analyzing and processing digital signals, Fast Fourier Transform (FFT) is widely used. However, the FFT algorithm is computationally expensive and time consuming to implement, and particularly in the fields of scientific computing, image processing, etc., fixed point data cannot meet the precision requirement, and a floating point format is required, so that a large number of floating point complex multiplications bring about a great computational burden. In the era of everything interconnection, the calculation efficiency is one of the important standards for measuring the system performance, and the lack of the calculation efficiency leads the compromise of the system scheme in the aspects of precision, real-time performance and the like. At present, emerging application scenes and requirements are continuously emerging, the number of corresponding FFT (fast Fourier transform) operations is different, and higher requirements are provided for system flexibility. Therefore, the realization of the FFT accelerator with high calculation efficiency and strong flexibility is significant.
The existing FFT acceleration methods are mainly divided into two categories:
(1) software optimization based enhancement method
The software optimization-based improvement method is generally realized on general platforms such as a CPU and a GPU and is established on the basis of deep understanding of a target platform pipeline mechanism and a memory architecture. Although such methods have been highly optimized on target platforms, they are limited by the inherent memory access patterns and still are computationally inefficient.
(2) Method based on hardware special design
Hardware-based methods are typically implemented as FPGAs or ASICs. The hardware-based approach may achieve higher performance since the storage architecture may be specifically designed. By virtue of the parallel characteristic, the FPGA is considered as the most promising solution for the first time, but the FPGA has high energy consumption and cannot meet the requirement of power consumption sensitive application. The ASIC-based scheme has high area efficiency and energy efficiency, but due to the solidified circuit function, only supports single application, has insufficient flexibility and high design cost, and cannot adapt to the iteration speed of emerging applications.
In summary, the above solutions cannot satisfy multiple requirements of calculation, area and energy efficiency, real-time performance, and flexibility at the same time.
Disclosure of Invention
In order to solve the problem, the invention provides a multi-level storage structure systolic array reconfigurable processor aiming at FFT base module mapping, and the dynamic reconfigurable processor architecture CGRA is used for realizing FFT acceleration. The CGRA tool chain adopts a high-level language (such as C or C + +), so that the development period can be shortened. The reconfigurable unit provides a plurality of layers of flexibility and parallelism for the CGRA. In addition, CGRA is superior to fine-grained FPGAs in both energy and area efficiency.
The technical scheme of the invention is as follows: a systolic array reconfigurable processor for FFT-based block mapping, comprising:
the reconfigurable processing unit array, the main controller, the shared memory and the on-chip memory;
the reconfigurable processing unit array is responsible for FFT operation and comprises m multiplied by n reconfigurable processing units, wherein m is the number of rows and n is the number of columns;
the main controller is used for analyzing the configuration packet and writing configuration information into a configuration memory in each reconfigurable processing unit, the reconfigurable processing units execute corresponding operations under the dual drive of data flow and configuration flow, each reconfigurable processing unit and the interconnection among the reconfigurable processing units can be configured independently, and the reconfigurable processing unit array can be dynamically divided into subarrays for algorithm level parallel processing to realize acceleration;
the shared memory comprises a plurality of groups of memories and is used for carrying out data interaction with the on-chip memory and storing intermediate data generated by each stage of FFT operation;
the on-chip memory includes global and local registers for storing programs, configuration information and data.
On the other hand, for the systolic array processor mapped by the FFT base module, the method for executing the operation processing comprises the following steps:
firstly, a main controller moves original data from an on-chip memory to a shared memory; after the data preparation is finished, the main controller analyzes the configuration words and writes the configuration information of each reconfigurable unit into a corresponding local register; after all data and configuration information are prepared, initializing a timer and starting the reconfigurable processing unit array;
secondly, reading configuration information by the reconfigurable processing unit array, and determining iteration times; the method comprises the following steps that a part of reconfigurable processing units read original data from a shared memory, each reconfigurable processing unit reads corresponding configuration information and executes specified operation, and once iteration of a reconfigurable processing unit array is finished after all reconfigurable processing units complete operation; continuing to execute until all iterations are completed; stopping the timer and recording the number of clock cycles; during the FFT operation, intermediate data generated by each stage of FFT operation is stored in a shared memory;
and thirdly, writing the FFT operation result into the shared memory by the partial reconfigurable processing unit, and then writing the FFT operation result into the on-chip memory.
Has the beneficial effects that:
the invention provides a multi-level storage structure pulse array reconfigurable processor for mapping of an FFT (fast Fourier transform) base module, which can effectively improve the high calculation efficiency of FFT (fast Fourier transform) operation in a floating point number format and particularly can meet the application requirements of high precision and strong real-time performance; by simply increasing the capacity of the shared memory, the FFT operation with larger points can be processed under the condition of not changing other hardware modules, and the expandability is strong.
Drawings
FIG. 1 is a block diagram of a reconfigurable processor architecture according to the present invention;
FIG. 2 is a base 4 arithmetic core mapping module;
fig. 3 is a diagram of a subarray-based multi-point FFT mapping.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.
According to the embodiment of the invention, a multi-layer storage structure systolic array reconfigurable processor for mapping FFT base modules is provided, the invention optimizes FFT algorithm characteristics and memory access bandwidth, the method provides an FFT hardware acceleration scheme with high calculation efficiency and strong expandability on the whole, and the reconfigurable processor comprises:
the reconfigurable processing unit array, the shared memory, the main controller, the on-chip memory and other main modules are as shown in fig. 1.
The reconfigurable processing unit array is responsible for FFT operation and comprises m multiplied by n reconfigurable processing units.
The main controller is used for analyzing the configuration packet and writing the configuration information into a configuration memory in each reconfigurable processing unit, and the reconfigurable processing unit executes corresponding operation under the dual drive of the data stream and the configuration stream; each reconfigurable processing unit and the interconnection among the reconfigurable processing units can be configured independently, so that the reconfigurable processing unit array can be dynamically divided into sub-arrays to perform algorithm-level parallel processing, and acceleration is realized.
According to the embodiment of the invention, in the process of designing the system architecture, the processing architecture of the reconfigurable processor is dynamically recombined in real time according to the requirement of large-point FFT operation, independent processing units are configured into a systolic array through configuration information, and meanwhile, various systolic array architectures which are beneficial to various algorithms are formed by defining the cutting and splicing modes of various reconfigurable processing unit arrays through software. The large point number refers to FFT of 128K and 256K points, for example;
1. for example, when 2-dimensional folding is required to be performed on 256K-point FFT, matrix transposition occurs during FFT calculation, and ping-pong buffering is required for memory access bandwidth and 1-dimensional FFT calculation in order to maximize FFT calculation efficiency.
2. Due to the characteristic that the FFT disk-shaped calculation access memory is discontinuous, the share memory corresponding to the PE array is subjected to targeted optimization, including bank number and bank bit width.
The shared memory is a plurality of groups of memories and has two main functions, namely, the shared memory is responsible for data interaction with the on-chip memory, and the shared memory stores intermediate data generated by each stage of FFT operation. By increasing the capacity of the shared memory, the accelerator can process FFT operation with larger points, thereby facilitating subsequent expansion.
According to the embodiment of the invention, a hierarchical data storage system is designed for improving the data access efficiency, the architecture relates to three levels of a system, a reconfigurable processing unit array and a reconfigurable processing unit, and physical units for correspondingly providing data access are respectively a shared memory, a global register and a local register. The main role of the global registers is to store data and parameters pointing to the plurality of reconfigurable processing units. The local register is mainly used for storing intermediate data in the reconfigurable processing unit and only used for the current reconfigurable processing unit to access.
Therefore, the invention designs a three-layer storage structure by analyzing the characteristics of the FFT, can store input and output data, intermediate data and the like in a layered manner, and can quickly complete the implementation of the FFT algorithm by matching with a hardware architecture.
The main controller is responsible for controlling the operation of the whole system, including controlling the configuration and data of the reconfigurable processing unit array, data movement between the shared memory and the on-chip memory, and the like. The on-chip memory is used for storing programs, configuration information and data.
The present invention provides an FFT mapping mechanism, which is usually radix-2 or radix-4 according to the FFT algorithm characteristics. According to the invention, a modularized FFT mapping mode is designed according to the architectural characteristics, and a processing unit array is divided into a plurality of sub-arrays to realize algorithm-level parallel processing. And dividing a plurality of radix 2 or radix 4 sub-modules according to the number of the FFT points, wherein the implementation mode of the radix 4 sub-module is shown in FIG. 2. And then FFT mapping with different points can be obtained through multi-point splicing. During mapping, the mapping of different point numbers of FFT can be realized by splicing the plurality of basic modules. Fig. 3 is a diagram illustrating FFT mapping of multiple points. The mapping result can obtain:
TABLE 1 Performance of the proposed FFT architecture for different number of points
Figure 748013DEST_PATH_IMAGE002
The simulation data in table 1 shows that the proposed architecture flexibly supports FFT operation with a larger number of points, and has strong expandability. An FFT of 1K to 256K points can be achieved. According to one embodiment of the invention, typical application requirements in the target field are firstly analyzed, the range of the number of FFT operation points is judged, factors such as area and power consumption are comprehensively considered, and the capacity of the shared memory is determined. On the basis of definite hardware architecture, taking N-point FFT operation as an example to illustrate the processing steps, wherein N is an integer power of 4, and a base 4 FFT algorithm is adopted to be carried out in commonlog 4 NAnd (4) carrying out stage FFT operation.
Firstly, the main controller moves original data from an on-chip memory to a shared memory; after the data preparation is finished, the main controller analyzes the configuration words and writes the configuration information of each reconfigurable unit into a corresponding configuration memory; after all data and configuration information are prepared, a timer is initialized, and the reconfigurable processing unit array is started.
Secondly, reading configuration information by the reconfigurable processing unit array, and determining iteration times; the method comprises the following steps that a part of reconfigurable processing units read original data from a shared memory, each reconfigurable processing unit reads corresponding configuration information and executes specified operation, and once iteration of a reconfigurable processing unit array is finished after all reconfigurable processing units complete operation; continuing to execute until all iterations are completed; stopping the timer and recording the number of clock cycles; during the FFT operation, intermediate data generated by each stage of the FFT operation is stored in the shared memory.
Through the second step in this embodiment, the following advantages are obtained:
1. the reconfigurable processing unit and the basic butterfly unit of the FFT are mapped and fused. The configuration information of the reconfigurable processing units is combined in a multi-iteration mode, a large amount of similar configuration information is compressed, and the storage capacity of the configuration information is reduced. When the configuration information is executed, the iteration from top to bottom is executed, the iteration comprises the iterative execution of the configuration information of the whole framework, the iterative execution of the configuration information of a plurality of arrays is executed, and finally the iterative execution of the configuration information of each processing unit is refined.
2. During execution, configuration information preloading of each iterative execution is performed according to the occurrence frequency of each operator of the FFT algorithm, so that the operation process of the whole hardware structure can be accelerated.
And thirdly, writing the FFT operation result into the shared memory by the partial reconfigurable processing unit, and then writing the FFT operation result into the on-chip memory.
Compared with the traditional processor structure, the dynamic reconfiguration processing platform is simulated, and the result shows that the number of cycles required by the dynamic reconfiguration processing platform designed by the invention is obviously lower than that of the DSP and the FPGA.
TABLE 2 comparison of Properties
Figure 930732DEST_PATH_IMAGE004
Although the illustrative embodiments of the present invention have been described in order to facilitate those skilled in the art to understand the invention, it is to be understood that the invention is not limited in scope to the specific embodiments, but rather, it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and it is intended that all matter contained in the invention and created by the inventive concept be protected.

Claims (4)

1. A systolic array reconfigurable processor for FFT-based block mapping, comprising:
the reconfigurable processing unit array, the main controller, the shared memory and the on-chip memory;
the reconfigurable processing unit array is responsible for FFT operation and comprises m multiplied by n reconfigurable processing units, wherein m is the number of rows and n is the number of columns;
the main controller is used for analyzing the configuration packet and writing configuration information into a configuration memory in each reconfigurable processing unit, the reconfigurable processing units execute corresponding operations under the dual drive of data flow and configuration flow, each reconfigurable processing unit and the interconnection among the reconfigurable processing units can be configured independently, and the reconfigurable processing unit array can be dynamically divided into subarrays for algorithm level parallel processing to realize acceleration;
the shared memory comprises a plurality of groups of memories and is used for carrying out data interaction with the on-chip memory and storing intermediate data generated by each stage of FFT operation;
the on-chip memory includes global and local registers for storing programs, configuration information and data.
2. The systolic array reconfigurable processor for mapping to FFT-based blocks according to claim 1, further comprising:
a hierarchical data storage mode is adopted, the architecture hierarchy relates to three levels of a reconfigurable processor, a reconfigurable processing unit array and a reconfigurable processing unit, and physical units for correspondingly providing data access are respectively a shared memory, a global register and a local register;
the global register is mainly used for storing data and parameters pointing to the plurality of reconfigurable processing units; all reconfigurable processing units in the reconfigurable processing unit array can access data;
the local register is mainly used for storing intermediate data in the reconfigurable processing unit and only accessed by the current reconfigurable processing unit;
the main controller is responsible for controlling the configuration and data of the reconfigurable processing unit array and data movement between the shared memory and the on-chip memory.
3. The systolic array reconfigurable processor for mapping to FFT-based blocks according to claim 1, further comprising:
and for the requirement of FFT operation, the processing architecture of the reconfigurable processor is dynamically recombined in real time, the independent reconfigurable processing units are configured into a pulse array through configuration information, and meanwhile, various pulse array architectures suitable for various algorithms are formed by defining the cutting and splicing modes of various reconfigurable processing unit arrays through software.
4. The systolic array reconfigurable processor for mapping to FFT-based blocks according to claim 1, further comprising:
the method comprises the steps of adopting a modularized FFT mapping mode, dividing a reconfigurable processing unit array into a plurality of sub-arrays to achieve algorithm level parallel processing, dividing a plurality of radix 2 or radix 4 sub-modules according to the number of FFT points, obtaining FFT mapping with different points through multi-point splicing, and realizing the FFT mapping with different points through splicing the plurality of radix modules in the mapping process.
CN202210894357.3A 2022-07-28 2022-07-28 Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping Pending CN115080503A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210894357.3A CN115080503A (en) 2022-07-28 2022-07-28 Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210894357.3A CN115080503A (en) 2022-07-28 2022-07-28 Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping

Publications (1)

Publication Number Publication Date
CN115080503A true CN115080503A (en) 2022-09-20

Family

ID=83241965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210894357.3A Pending CN115080503A (en) 2022-07-28 2022-07-28 Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping

Country Status (1)

Country Link
CN (1) CN115080503A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19722365A1 (en) * 1996-05-28 1997-12-04 Nat Semiconductor Corp Reconfigurable computer component with adaptive logic processor
US20080155003A1 (en) * 2006-12-21 2008-06-26 National Chiao Tung University Pipeline-based reconfigurable mixed-radix FFT processor
CN101694648A (en) * 2009-08-28 2010-04-14 曙光信息产业(北京)有限公司 Fourier transform processing method and device
CN102043761A (en) * 2011-01-04 2011-05-04 东南大学 Fourier transform implementation method based on reconfigurable technology
CN202217276U (en) * 2011-06-17 2012-05-09 江苏中科芯核电子科技有限公司 FFT device based on parallel processing
CN102831099A (en) * 2012-07-27 2012-12-19 西安空间无线电技术研究所 Implementation method of 3072-point FFT (Fast Fourier Transform) operation
CN103678255A (en) * 2013-12-16 2014-03-26 合肥优软信息技术有限公司 FFT efficient parallel achieving optimizing method based on Loongson number three processor
CN104679670A (en) * 2015-03-10 2015-06-03 东南大学 Shared data caching structure and management method for FFT (fast Fourier transform) and FIR (finite impulse response) algorithms
WO2017125023A1 (en) * 2016-01-19 2017-07-27 清华大学 Pipeline reconfigurable single-precision floating-point fft/ifft coprocessor
CN109977347A (en) * 2019-03-29 2019-07-05 南京大学 A kind of restructural fft processor for supporting multi-mode to configure
CN110765709A (en) * 2019-10-15 2020-02-07 天津大学 FPGA-based 2-2 fast Fourier transform hardware design method
CN114201725A (en) * 2021-12-14 2022-03-18 电子科技大学 Narrowband communication signal processing method based on multimode reconfigurable FFT

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19722365A1 (en) * 1996-05-28 1997-12-04 Nat Semiconductor Corp Reconfigurable computer component with adaptive logic processor
US20080155003A1 (en) * 2006-12-21 2008-06-26 National Chiao Tung University Pipeline-based reconfigurable mixed-radix FFT processor
CN101694648A (en) * 2009-08-28 2010-04-14 曙光信息产业(北京)有限公司 Fourier transform processing method and device
CN102043761A (en) * 2011-01-04 2011-05-04 东南大学 Fourier transform implementation method based on reconfigurable technology
CN202217276U (en) * 2011-06-17 2012-05-09 江苏中科芯核电子科技有限公司 FFT device based on parallel processing
CN102831099A (en) * 2012-07-27 2012-12-19 西安空间无线电技术研究所 Implementation method of 3072-point FFT (Fast Fourier Transform) operation
CN103678255A (en) * 2013-12-16 2014-03-26 合肥优软信息技术有限公司 FFT efficient parallel achieving optimizing method based on Loongson number three processor
CN104679670A (en) * 2015-03-10 2015-06-03 东南大学 Shared data caching structure and management method for FFT (fast Fourier transform) and FIR (finite impulse response) algorithms
WO2017125023A1 (en) * 2016-01-19 2017-07-27 清华大学 Pipeline reconfigurable single-precision floating-point fft/ifft coprocessor
CN109977347A (en) * 2019-03-29 2019-07-05 南京大学 A kind of restructural fft processor for supporting multi-mode to configure
CN110765709A (en) * 2019-10-15 2020-02-07 天津大学 FPGA-based 2-2 fast Fourier transform hardware design method
CN114201725A (en) * 2021-12-14 2022-03-18 电子科技大学 Narrowband communication signal processing method based on multimode reconfigurable FFT

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
冷金麟等: "《Visual FoxPro程序设计》", 31 January 2012, 上海交通大学出版社 *

Similar Documents

Publication Publication Date Title
US7640284B1 (en) Bit reversal methods for a parallel processor
Mittal A survey of accelerator architectures for 3D convolution neural networks
CN111915001B (en) Convolution calculation engine, artificial intelligent chip and data processing method
US20210150362A1 (en) Neural network compression based on bank-balanced sparsity
CN109977347B (en) Reconfigurable FFT processor supporting multimode configuration
KR20220051006A (en) Method of performing PIM (PROCESSING-IN-MEMORY) operation, and related memory device and system
Que et al. Recurrent neural networks with column-wise matrix–vector multiplication on FPGAs
US20180373677A1 (en) Apparatus and Methods of Providing Efficient Data Parallelization for Multi-Dimensional FFTs
Zhou et al. Addressing sparsity in deep neural networks
Nguyen et al. ShortcutFusion: From tensorflow to FPGA-based accelerator with a reuse-aware memory allocation for shortcut data
Lou et al. RV-CNN: Flexible and efficient instruction set for CNNs based on RISC-V processors
Huang et al. A high performance multi-bit-width booth vector systolic accelerator for NAS optimized deep learning neural networks
JP2023534068A (en) Systems and methods for accelerating deep learning networks using sparsity
US11614945B2 (en) Apparatus and method of a scalable and reconfigurable fast fourier transform
Akin et al. FFTs with near-optimal memory access through block data layouts: Algorithm, architecture and design automation
Akkad et al. Embedded deep learning accelerators: A survey on recent advances
Asadikouhanjani et al. Enhancing the utilization of processing elements in spatial deep neural network accelerators
Arora et al. CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration
US20230117042A1 (en) Implementation of discrete fourier-related transforms in hardware
Mahale et al. Windconv: A fused datapath cnn accelerator for power-efficient edge devices
CN115080503A (en) Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping
US20220188613A1 (en) Sgcnax: a scalable graph convolutional neural network accelerator with workload balancing
Srinivasa et al. Trends and opportunities for SRAM based in-memory and near-memory computation
US20210241806A1 (en) Streaming access memory device, system and method
Choi et al. Energy-efficient and parameterized designs for fast Fourier transform on FPGAs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220920

RJ01 Rejection of invention patent application after publication