CN106095730A - A kind of FFT floating-point optimization method based on ILP and DLP - Google Patents

A kind of FFT floating-point optimization method based on ILP and DLP Download PDF

Info

Publication number
CN106095730A
CN106095730A CN201610473373.XA CN201610473373A CN106095730A CN 106095730 A CN106095730 A CN 106095730A CN 201610473373 A CN201610473373 A CN 201610473373A CN 106095730 A CN106095730 A CN 106095730A
Authority
CN
China
Prior art keywords
layer
grand
calculation
fft
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610473373.XA
Other languages
Chinese (zh)
Other versions
CN106095730B (en
Inventor
顾乃杰
任开新
叶鸿
周文博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201610473373.XA priority Critical patent/CN106095730B/en
Publication of CN106095730A publication Critical patent/CN106095730A/en
Application granted granted Critical
Publication of CN106095730B publication Critical patent/CN106095730B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution

Abstract

The invention discloses a kind of FFT floating-point optimization method based on ILP and DLP, it is characterized in that carrying out as follows: 1, determine the iteration number of plies, and be divided into three-decker;2, by using the operations such as position inverted sequence instruction, complete in-degree layer and calculate;3, after completing the calculating of in-degree layer, the intermediate layer that will carry out is calculated and classifies, odd-level and two kinds of situations of even level are carried out computing respectively, and obtains intermediate layer result of calculation;4, use grand of simulation transmission operation, intermediate layer result of calculation is adjusted, and complete the calculating of out-degree layer.The present invention can solve the relevant problem limited with structure of the instruction in the presence of annual reporting law, and gives full play to arithmetic unit load usefulness, thus increases substantially the average utilization of bottleneck.

Description

A kind of FFT floating-point optimization method based on ILP and DLP
Technical field
The invention belongs to vector processor and digital processing field, be specifically related to hardware based on ILP and DLP and put down The method that on platform, floating-point version FFT realizes efficiently calculating.
Background technology
Discrete Fourier transform (Discrete Fourier Transform, DFT) is at modern signal processing system regions In be widely used, such as Radar Signal Processing, SAR image process, sonar calculating, video image algorithm, spectrum analysis, speech recognition Deng.Fourier's change calculations is typical computation-intensive and memory access intensive applications, and the calculating of the DFT transform of such as N point is complicated Degree is O (N2).A kind of fast Fourier transform of nineteen sixty-five Cooley and Turkey proposition (Fast Fourier Transform, FFT) computational methods, can significantly decrease operand, and computation complexity is by original O (N2) fall below O (Nlog2N).At signal Ought to be with generally the highest to the requirement of real-time calculated, FFT computational efficiency is the highest, and the real-time of signal processing is the best.
Instruction level parallelism (Instruction Level Parallelism, ILP) is that finger processor is in same instruction week The instruction of a plurality of executed in parallel is launched in phase.Data level parallel (Data Level Parallelism, DLP) refers to same One moment carried out the architecture of parallel computation to different pieces of information.Hardware platform based on ILP Yu DLP the most all can use VLIW With SIMD technology so that it is large-scale efficient computing can be carried out.
The hardware platform combined due to ILP with DLP technology is complex, the research of fast Fourier transform based on it Do not launched.
Summary of the invention
The present invention is the weak point overcoming prior art to exist, and proposes a kind of FFT floating-point optimization based on ILP and DLP Method, is concerned with and the restriction of structure to solving the order of annual reporting law middle finger, and gives full play to arithmetic unit load usefulness, thus significantly Degree improves the average utilization of bottleneck.
In order to solve above-mentioned technical problem, the present invention by the following technical solutions:
The feature of a kind of FFT floating-point optimization method based on ILP and DLP of the present invention is to carry out as follows:
Step 1, assume a length of M of FFT input vector to be calculated, determine that the iteration number of plies is according to described length M N;Wherein M=2N;M, N are positive integer, and N >=6;First four layers of definition iteration number of plies N is in-degree layer, and layer 5 is to N-2 layer For intermediate layer;N-1 layer and n-th layer are out-degree layer;
Step 2, the inverted sequence instruction of use position, read described FFT input vector inverted sequence in depositor, and by in-degree layer institute Corresponding FFT twiddle factor is read in corresponding depositor;
Step 3, the FFT input vector being stored in depositor and FFT twiddle factor are carried out in-degree layer butterfly calculate, obtain In-degree layer result of calculation be stored in temporarily providing room;
Step 4, N 4 is assigned to n;
Step 5, judge whether n is odd number, the most then perform step 6, otherwise, perform step 8;
Step 6, from described temporarily providing room read in-degree layer result of calculation with corresponding to N-n+1 layer twiddle factor also Carry out butterfly calculating, obtain the covering of N-n+1 layer result of calculation and store in input vector space;
Step 7, n-1 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8;
Step 8, read result of calculation and N-n+1 layer from described temporarily providing room to the rotation corresponding to N-n+5 layer The factor also carries out butterfly calculating, obtains result of calculation covering and stores in described temporarily providing room;
Step 9, n-4 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8;
Step 10, by grand of simulation transmission operation, the result of calculation transposition in described temporarily providing room is reset, and reads Twiddle factor corresponding to out-degree layer is in corresponding depositor;
Step 11, transposition is reset after result of calculation and twiddle factor corresponding to out-degree layer carry out out-degree layer butterfly meter Calculate, obtain out-degree layer result of calculation and store in output memory headroom;Thus complete FFT floating-point optimization method.
The feature of FFT floating-point optimization method based on ILP and DLP of the present invention lies also in,
Grand of simulation in described step 10 transmission operation is to carry out as follows:
It is grand to there is K execution in step 10.1, definition processor, and wherein, i-th execution is grand is designated as Pi;1≤i≤K;K is just Integer;Then continuous print K row is instructed grand of the simulation as a K × K and transmits operational group;
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4, by jth row instruct in i-th perform grand PiInterior data store (i+j-1) in instructing to jth row Mod K performs grand P(i+j-1)mod KIn;Thus by the data point reuse of grand for same execution middle different instruction row to corresponding dos command line DOS Difference perform grand in;1≤j≤K;
Step 10.5, i+1 is assigned to i;And judge whether i > K sets up, if setting up, then perform step 10.6;Otherwise, Return step 10.4;
Step 10.6, j+1 is assigned to j;And judge whether j > K sets up, if setting up, then complete the transposition of result of calculation Reset;Otherwise, step 10.3 is returned.
Compared with the prior art, the present invention has the beneficial effect that:
1, the present invention proposes a kind of new floating-point version FFT optimization method, to adapt to the feature of ILP Yu DLP hardware platform, By adjusting base two Cooley-Tukey algorithm structure, compress it and calculate while number of plies, use grand of simulation transmission operation, interior Deposit the technology such as ping-pong operation and cache operations, to hardware platform based on ILP Yu DLP technology, carry out fast Fourier change The efficient deployment changed;Effectively reduce operation clock expense, thus improve hardware platform for fast Fourier transform meter The efficiency calculated;
2, use three layers of computing structure model due to the present invention so that the calculating of multiple structure originally, become three layers;From And decrease the content of registers refreshing that between ectonexine circulation, scheduling is caused and empty caused clock expense with streamline;
3, owing to present invention employs internal memory ping-pong operation, reading and peek originally is made to be stored among a memory block, Divide into two pieces of table tennis internal memories to store;Thus avoid and internal memory is read while write to the clock expense caused, improve meter Calculate efficiency;
4, the present invention simulates grand transmission operation is the difference using parallel instructions technology to be caused by data level concurrent technique Perform the data in sub-clustering, be adjusted among identical execution sub-clustering, to ensure follow-up calculating;In this operation effectively avoids Deposit Bank conflict, and improve each efficiency performing grand data point reuse;
5, the present invention further excavates the symmetry of butterfly coefficient, and decrease butterfly coefficient in computing prefetches number, To reach to reduce the purpose that depositor uses;This operation can reduce the twiddle factor of nearly half, uses sky reducing internal memory While between, decrease depositor by the number shared by twiddle factor;
6, through experimental verification, the inventive method is in 32 floating-point version complex Fourier transform, defeated to its 1024 The computing entered successfully will be compressed to 980 the clock cycle;Bottleneck functional part utilization rate in each layer computation structure reaches respectively To 96.68%, 98.25% and 100%.
Accompanying drawing explanation
Fig. 1 is the general flow chart of the present invention;
Fig. 2 is to simulate grand transmission operational flowchart in the present invention;
Fig. 3 is that intermediate layer of the present invention calculates four layer models used.
Detailed description of the invention
The purpose of the present invention is to propose to the optimization method of a kind of floating-point version FFT being applicable to ILP and DLP hardware platform, To high performance optimization can be carried out on its hardware infrastructure provided.Following detailed description of the invention only with BWDSP104x platform is optimized the discussion of method as an example, but optimisation technique and method not only limit in the present invention In BWDSP104x platform.The hardware platform of any ILP and DLP is suitable in the prioritization scheme of the present invention.
BWDSP104x platform have 4 execution grand (x, y, z, t), each grand in have 8 ALUs (ALU), 8 Individual multiplier (MUL), 4 shift units (SHIFT), 1 surpasses calculation device and one group of general purpose register set comprising 128 depositors. It has 11 level production lines, and each dos command line DOS at most can the most parallel 16 word instructions.
In the present embodiment, a kind of FFT floating-point optimization method based on ILP and DLP is to carry out as follows:
Step 1, assume a length of M of FFT input vector to be calculated, determine that the iteration number of plies is according to described length M N;Illustrating as a example by input vector a length of 1024 in the present embodiment, other length can be implemented by similar scheme; Wherein M=2N;M, N are positive integer, and N >=6;M=10 now, N=1024;First four layers of definition iteration number of plies N is in-degree Layer, layer 5 to N-2 layer is intermediate layer;N-1 layer and n-th layer are out-degree layer;Figure one calculates the flow process of process for this FFT Figure, in figure, 1-4 step depicts in-degree layer calculating process, 5-7 step depicts intermediate layer calculating process, 8-10 step depicts Out-degree layer calculates process;
Step 2, the inverted sequence instruction of use position, read described FFT input vector inverted sequence in depositor, and by in-degree layer institute Corresponding FFT twiddle factor is read in corresponding depositor;Table one is 4 and performs grand digital signal processor, uses position After data are read in inverted sequence instruction, the data of storage in each depositor.The difference that same execution is grand is result in by its instruction feature The data that depositor is read are inverted sequence, and grand the read data of the difference of same depositor execution are order.Table two is listed The details of twiddle factor needed for in-degree layer.Twiddle factor from table two it will be seen that needed for in-degree layer, can come by three numbers Replace, respectively: cos (π/4), sin (π/8) and cos (π/8).
The data (its order in array of digitized representation in table) that one inverted sequence of table reads
x y z t
r7:6 0 1 2 3
r9:8 512 513 514 515
r11:10 256 257 258 259
r13:12 768 769 770 771
The front four layers of twiddle factor of table two
Step 3, the FFT input vector being stored in depositor and FFT twiddle factor are carried out in-degree layer butterfly calculate, obtain In-degree layer result of calculation be stored in temporarily providing room;The purpose opening up temporarily providing room is to ensure that it enters with input vector space Row internal memory ping-pong operation so that calculating process can complete read-write operation flowing full water when simultaneously;
Step 4, N 4 is assigned to n;
Step 5, judge whether n is odd number, the most then perform step 6, otherwise, perform step 8;
Step 6, from described temporarily providing room read in-degree layer result of calculation with corresponding to N-n+1 layer twiddle factor also Carry out butterfly calculating, obtain the covering of N-n+1 layer result of calculation and store in input vector space;
Step 7, n-1 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8;
Step 8, read result of calculation and N-n+1 layer from described temporarily providing room to the rotation corresponding to N-n+5 layer The factor also carries out butterfly calculating, obtains result of calculation covering and stores in described temporarily providing room;Described temporary sky in this step Between to discuss respectively according to different disposal process;If if this FFT calculating process needs to be calculated by odd-level, reading now Taking space is input vector space, and memory space is the temporarily providing room opened up;If this FFT calculates process without passing through odd number If layer calculates, space of reading now is the temporarily providing room opened up, and memory space is input vector space;In being more than Deposit the process of ping-pong operation;
The computation model of middle four layers is similar with the computation model of first four layers, and 16 numbers are simply combined into data Data between each unit, then in units of data block, are calculated by block.Figure three is middle four layers of computation model sketch, In figure in units of data block, carry out between the data of each butterfly computation respective sequence the most within the data block.The figure upper left corner Dotted line frame in, data block has been carried out simple description.
The present invention has excavated 1/4th symmetry of intermediate layer twiddle factor further;Owing to the intermediate layer of the present invention is counted Calculate the four layers of calculating being integrated with computation structure originally so that data dependence (Date existing between former each layer Dependence, DP), need to preserve data by substantial amounts of depositor and solve;This just brings pole to the use of depositor Big pressure, especially for DSP based on ILP Yu DLP carries out large-scale calculations;Now can be according to twiddle factor four points One of symmetry, the twiddle factor of the first half that the later half twiddle factor of current layer is disguised oneself as do once answer multiplication;By formula (1) formula (3) derived with formula (1), characterizes 1/4th symmetric features that this twiddle factor is had;
The present embodiment calculates required twiddle factor for intermediate layer be respectivelyWithTime, it is achieved butterfly computation Core code is respectively program segment 1 and program segment 2;In following procedure section, the butterfly carried out needed for r11:10 Yu r13:12 storage Two groups of numbers of computing;R53:52 storesThe real part of butterfly coefficient and imaginary part;R15:14 is used as temporary register, and storage is multiple The intermediate object program that number is multiplied;
Program segment 1:
Program segment 2:
Step 9, n-4 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8;
Step 10, by grand of simulation transmission operation, the result of calculation transposition in described temporarily providing room is reset, and reads Twiddle factor corresponding to out-degree layer is in corresponding depositor;
Step 11, transposition is reset after result of calculation and twiddle factor corresponding to out-degree layer carry out out-degree layer butterfly meter Calculate, obtain out-degree layer result of calculation and store in output memory headroom;Thus complete FFT floating-point optimization method.
Wherein, grand of the simulation in step 10 transmission operation is to carry out as follows:
It is grand to there is K execution in step 10.1, definition processor, and wherein, i-th execution is grand is designated as Pi;1≤i≤K;K is just Integer;Then continuous print K row is instructed grand of the simulation as a K × K and transmits operational group;The present embodiment is grand with 4 execution The description of process is carried out as a example by processor;Table three lists grand of the simulation transmission operational group of 4 × 4;The most as shown in Figure 2 Flow process, that simulates grand transmission operation will build grand transmission operational group of a simulation at the beginning;
Grand transmission operational group simulated by table three
Macro 1 Macro 2 Macro 3 Macro 4
r6 0 1 2 3
r7 4 5 6 7
r8 8 9 10 11
r9 12 13 14 15
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4, by jth row instruct in i-th perform grand PiInterior data store (i+j-1) in instructing to jth row Mod K performs grand P(i+j-1)mod KIn;Thus by the data point reuse of grand for same execution middle different instruction row to corresponding dos command line DOS Difference perform grand in;1≤j≤K;This step is the core process of grand of simulation transmission operation;As shown in Figure 2, this process is The inside key operation of double-deck circulation;Its kernel program section is as follows: wherein four perform grand to identify with x, y, z and t respectively;
1. xr11:10=zr7:6 | | zr7:6=xr11:10 | | yr13:12=tr9:8 | | tr9:8=yr13:12
2. xr9:8=yr7:6 | | yr7:6=xr9:8 | | zr13:12=tr11:10 | | tr11:10=zr13:12
3. xr13:12=tr7:6 | | tr7:6=xr13:12 | | yr11:10=zr9:8 | | zr9:8=yr11:10
Step 10.5, i+1 is assigned to i;And judge whether i > K sets up, if setting up, then perform step 10.6;Otherwise, Return step 10.4;
Step 10.6, j+1 is assigned to j;And judge whether j > K sets up, if setting up, then complete the transposition of result of calculation Reset;Otherwise, step 10.3 is returned;Table four gives finally resets the result obtained in the present embodiment.
The result after grand transmission simulated by table four
Macro 1 Macro 2 Macro 3 Macro 4
r6 0 4 8 12
r7 1 5 9 13
r8 2 6 10 14
r9 3 7 11 15

Claims (2)

1. a FFT floating-point optimization method based on ILP and DLP, is characterized in that carrying out as follows:
Step 1, assume a length of M of FFT input vector to be calculated, determine that the iteration number of plies is N according to described length M;Its Middle M=2N;M, N are positive integer, and N >=6;First four layers of definition iteration number of plies N is in-degree layer, during layer 5 to N-2 layer is Interbed;N-1 layer and n-th layer are out-degree layer;
Step 2, the inverted sequence instruction of use position, read described FFT input vector inverted sequence in depositor, and by corresponding to in-degree layer FFT twiddle factor be read in corresponding depositor;
Step 3, the FFT input vector being stored in depositor and FFT twiddle factor are carried out in-degree layer butterfly calculate, obtain enters Degree layer result of calculation is stored in temporarily providing room;
Step 4, N 4 is assigned to n;
Step 5, judge whether n is odd number, the most then perform step 6, otherwise, perform step 8;
Step 6, from described temporarily providing room, read in-degree layer result of calculation and the twiddle factor corresponding to N-n+1 layer and carry out Butterfly calculates, and obtains the covering of N-n+1 layer result of calculation and stores in input vector space;
Step 7, n-1 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8;
Step 8, read result of calculation and N-n+1 layer from described temporarily providing room to the twiddle factor corresponding to N-n+5 layer And carry out butterfly calculating, obtain result of calculation covering and store in described temporarily providing room;
Step 9, n-4 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8;
Step 10, by grand of simulation transmission operation, the result of calculation transposition in described temporarily providing room is reset, and reads out-degree Twiddle factor corresponding to Ceng is in corresponding depositor;
Step 11, transposition is reset after result of calculation and twiddle factor corresponding to out-degree layer carry out out-degree layer butterfly and calculate, Obtain out-degree layer result of calculation to store in output memory headroom;Thus complete FFT floating-point optimization method.
FFT floating-point optimization method based on ILP and DLP the most according to claim 1, is characterized in that, in described step 10 Grand of simulation transmission operation be to carry out as follows:
It is grand to there is K execution in step 10.1, definition processor, and wherein, i-th execution is grand is designated as Pi;1≤i≤K;K is positive integer; Then continuous print K row is instructed grand of the simulation as a K × K and transmits operational group;
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4, by jth row instruct in i-th perform grand PiInterior data store (i+j-1) mod K in instructing to jth row The grand P of individual execution(i+j-1)modKIn;Thus by the difference of the data point reuse of grand for same execution middle different instruction row to corresponding dos command line DOS Perform grand in;1≤j≤K;
Step 10.5, i+1 is assigned to i;And judge whether i > K sets up, if setting up, then perform step 10.6;Otherwise, return Step 10.4;
Step 10.6, j+1 is assigned to j;And judge whether j > K sets up, if setting up, then the transposition completing result of calculation is reset; Otherwise, step 10.3 is returned.
CN201610473373.XA 2016-06-23 2016-06-23 A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level Active CN106095730B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610473373.XA CN106095730B (en) 2016-06-23 2016-06-23 A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610473373.XA CN106095730B (en) 2016-06-23 2016-06-23 A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level

Publications (2)

Publication Number Publication Date
CN106095730A true CN106095730A (en) 2016-11-09
CN106095730B CN106095730B (en) 2018-10-23

Family

ID=57253425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610473373.XA Active CN106095730B (en) 2016-06-23 2016-06-23 A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level

Country Status (1)

Country Link
CN (1) CN106095730B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101347A (en) * 2018-07-16 2018-12-28 北京理工大学 A kind of process of pulse-compression method of the FPGA heterogeneous computing platforms based on OpenCL
CN109783054A (en) * 2018-12-20 2019-05-21 中国科学院计算技术研究所 A kind of the butterfly computation processing method and system of RSFQ fft processor

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090164752A1 (en) * 2004-08-13 2009-06-25 Clearspeed Technology Plc Processor memory system
CN103902506A (en) * 2014-04-16 2014-07-02 中国科学技术大学先进技术研究院 FFTW3 optimization method based on loongson 3B processor
CN105630737A (en) * 2016-01-05 2016-06-01 合肥康捷信息科技有限公司 Optimizing method of split-radix FFT (fast fourier transform) algorithm based on ternary tree

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090164752A1 (en) * 2004-08-13 2009-06-25 Clearspeed Technology Plc Processor memory system
CN103902506A (en) * 2014-04-16 2014-07-02 中国科学技术大学先进技术研究院 FFTW3 optimization method based on loongson 3B processor
CN105630737A (en) * 2016-01-05 2016-06-01 合肥康捷信息科技有限公司 Optimizing method of split-radix FFT (fast fourier transform) algorithm based on ternary tree

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101347A (en) * 2018-07-16 2018-12-28 北京理工大学 A kind of process of pulse-compression method of the FPGA heterogeneous computing platforms based on OpenCL
CN109101347B (en) * 2018-07-16 2021-07-20 北京理工大学 Pulse compression processing method of FPGA heterogeneous computing platform based on OpenCL
CN109783054A (en) * 2018-12-20 2019-05-21 中国科学院计算技术研究所 A kind of the butterfly computation processing method and system of RSFQ fft processor

Also Published As

Publication number Publication date
CN106095730B (en) 2018-10-23

Similar Documents

Publication Publication Date Title
Cao et al. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity
CN108268423B (en) Microarchitecture implementing enhanced parallelism for sparse linear algebraic operations with write-to-read dependencies
CN107844322B (en) Apparatus and method for performing artificial neural network forward operations
CN107341542B (en) Apparatus and method for performing recurrent neural networks and LSTM operations
WO2021057746A1 (en) Neural network processing method and apparatus, computer device and storage medium
CN112559051A (en) Deep learning implementation using systolic arrays and fusion operations
CN104025067B (en) With the processor for being instructed by vector conflict and being replaced the shared full connection interconnection of instruction
US20160283240A1 (en) Apparatuses and methods to accelerate vector multiplication
CN103955447B (en) FFT accelerator based on DSP chip
CN107451097B (en) High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor
EP3451239A1 (en) Apparatus and method for executing recurrent neural network and lstm computations
CN106933777B (en) The high-performance implementation method of the one-dimensional FFT of base 2 based on domestic 26010 processor of Shen prestige
CN108431770A (en) Hardware aspects associated data structures for accelerating set operation
CN103955446A (en) DSP-chip-based FFT computing method with variable length
CN110163333A (en) The parallel optimization method of convolutional neural networks
CN106095730A (en) A kind of FFT floating-point optimization method based on ILP and DLP
CN111401537A (en) Data processing method and device, computer equipment and storage medium
Li et al. Automatic FFT performance tuning on OpenCL GPUs
CN113741977B (en) Data operation method, data operation device and data processor
Cantó-Navarro et al. Floating-point accelerator for biometric recognition on FPGA embedded systems
Mermer et al. Efficient 2D FFT implementation on mediaprocessors
CN103902506A (en) FFTW3 optimization method based on loongson 3B processor
Lee et al. Large‐scale 3D fast Fourier transform computation on a GPU
JP3709291B2 (en) Fast complex Fourier transform method and apparatus
Saybasili et al. Highly parallel multi-dimentional fast fourier transform on fine-and coarse-grained many-core approaches

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant