CN106095730B - A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level - Google Patents

A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level Download PDF

Info

Publication number
CN106095730B
CN106095730B CN201610473373.XA CN201610473373A CN106095730B CN 106095730 B CN106095730 B CN 106095730B CN 201610473373 A CN201610473373 A CN 201610473373A CN 106095730 B CN106095730 B CN 106095730B
Authority
CN
China
Prior art keywords
macro
calculation
fft
layer
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610473373.XA
Other languages
Chinese (zh)
Other versions
CN106095730A (en
Inventor
顾乃杰
任开新
叶鸿
周文博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201610473373.XA priority Critical patent/CN106095730B/en
Publication of CN106095730A publication Critical patent/CN106095730A/en
Application granted granted Critical
Publication of CN106095730B publication Critical patent/CN106095730B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution

Abstract

The invention discloses the FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level a kind of, it is characterized in that carrying out as follows:1, determine the iteration number of plies, and be divided into three-decker;2, by using operations such as position inverted sequence instructions, completes in-degree layer and calculate;3, after completing the calculating of in-degree layer, classifies to the middle layer calculating that will be carried out, operation respectively is carried out to odd-level and two kinds of situations of even level, and obtain middle layer result of calculation;4, using macro transmission operation of simulation, middle layer result of calculation is adjusted, and complete the calculating of out-degree layer.The present invention can solve the instruction in the presence of annual reporting law relevant the problem of being limited with structure, and give full play to arithmetic unit load efficiency, to increase substantially the average utilization of bottleneck.

Description

A kind of FFT floating-points optimization of the Parallel I of the grade based on instruction LP and parallel DLP of data level Method
Technical field
The invention belongs to vector processor and digital processing fields, and in particular to the hardware based on ILP and DLP is flat Floating-point version FFT realizes the method efficiently calculated on platform.
Background technology
Discrete Fourier transform (Discrete Fourier Transform, DFT) is in modern signal processing system regions In be widely used, such as Radar Signal Processing, SAR image processing, sonar calculating, video image algorithm, spectrum analysis, speech recognition Deng.It is typical computation-intensive and memory access intensive applications, such as the calculating complexity of the DFT transform of N points that Fourier, which changes calculating, Degree is O (N2).A kind of nineteen sixty-five Fast Fourier Transform (FFT) of Cooley and Turkey proposition (Fast Fourier Transform, FFT) computational methods can significantly decrease operand, and computation complexity is by original O (N2) fall below O (Nlog2N).At signal Ought to use it is usually very high to the requirement of real-time of calculating, FFT computational efficiencies are higher, and the real-time of signal processing is better.
Instruction level parallelism (Instruction Level Parallelism, ILP) is finger processor in same instruction week The a plurality of instruction executed parallel of transmitting in phase.Data level parallel (Data Level Parallelism, DLP) is referred to same One moment carried out different data the architecture of parallel computation.Hardware platform based on ILP and DLP usually can all use VLIW With SIMD technologies, large-scale efficient operation can be carried out.
Since the hardware platform that ILP is combined with DLP technologies is complex, the research based on its Fast Fourier Transform (FFT) It is not unfolded.
Invention content
The present invention is to propose a kind of grade Parallel I LP and data level based on instruction in place of overcoming the shortcomings of the prior art The FFT floating-point optimization methods of parallel DLP enable relevant and structure limitation to solve annual reporting law middle finger, and give full play to operation Component load efficiency, to increase substantially the average utilization of bottleneck.
In order to solve the above-mentioned technical problem, the present invention uses following technical scheme:
A kind of the characteristics of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level of the present invention be by Following steps carry out:
Step 1 assumes that the length for the FFT input vectors to be calculated is M, determines that the iteration number of plies is according to the length M N;Wherein M=2N;M, N is positive integer, and N >=6;Define iteration number of plies N first four layers are in-degree layer, and layer 5 is to N-2 layers For middle layer;N-1 layers are out-degree layer with n-th layer;
Step 2 is instructed using position inverted sequence, and the FFT input vectors inverted sequence is read in register, and by in-degree layer institute Corresponding FFT twiddle factors are read into corresponding register;
Step 3 carries out the calculating of in-degree layer butterfly to the FFT input vectors and FFT twiddle factors that are stored in register, obtains In-degree layer result of calculation deposit temporarily providing room in;
N -4 is assigned to n by step 4;
Step 5 judges whether n is odd number, if so, thening follow the steps 6, otherwise, executes step 8;
Step 6, from the temporarily providing room read in-degree layer result of calculation with N-n+1 layers corresponding to twiddle factor simultaneously Butterfly calculating is carried out, is obtained in N-n+1 layers of result of calculation covering storage to input vector space;
N-1 is assigned to n by step 7;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8;
Step 8 reads the rotation corresponding to result of calculation and N-n+1 layers to N-n+5 layers from the temporarily providing room The factor simultaneously carries out butterfly calculating, obtains in result of calculation covering storage to the temporarily providing room;
N-4 is assigned to n by step 9;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8;
Step 10 is operated by simulating macro transmission, and the result of calculation transposition in the temporarily providing room is reset, and reads In twiddle factor to corresponding register corresponding to out-degree layer;
Step 11 carries out out-degree layer butterfly meter to the result of calculation after transposition rearrangement and the twiddle factor corresponding to out-degree layer It calculates, obtains in out-degree layer result of calculation storage to output memory headroom;To complete FFT floating-point optimization methods.
The characteristics of FFT floating-point optimization methods of the present invention based on ILP and DLP, lies also in,
Macro transmission operation of simulation in the step 10 is to carry out as follows:
There are K execution is macro for step 10.1, definition processor, wherein i-th of execution is macro to be denoted as Pi;1≤i≤K;K is just Integer;Continuous K rows are then instructed to macro transmission operational group of simulation as a K × K;
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4 executes macro P by i-th in the instruction of jth rowiInterior data are stored into the instruction of jth row (i+j-1) Mod K execute macro P(i+j-1)mod KIt is interior;Thus by the same data point reuse for executing macro middle different instruction row to corresponding instruction row Difference execute it is macro in;1≤j≤K;
I+1 is assigned to i by step 10.5;And judge whether i > K are true, if so, then follow the steps 10.6;Otherwise, Return to step 10.4;
J+1 is assigned to j by step 10.6;And judge whether j > K are true, if so, then complete the transposition of result of calculation It resets;Otherwise, return to step 10.3.
Compared with the prior art, the present invention has the beneficial effect that:
1, the present invention proposes a kind of new floating-point version FFT optimization methods, to adapt to the characteristics of ILP is with DLP hardware platforms, By adjusting two Cooley-Tukey algorithm structures of base, while compressing its calculating number of plies, is operated using macro transmission of simulation, is interior The technologies such as ping-pong operation and cache operations are deposited, to the hardware platform based on ILP Yu DLP technologies, carry out fast Fourier change The efficient deployment changed;Operation clock expense is effectively reduced, to improve hardware platform for Fast Fourier Transform (FFT) meter The efficiency of calculation;
2, since the present invention uses three layers of computing structure model so that the calculating of script multilayered structure becomes three layers;From And reduces the content of registers refreshing caused by being dispatched between ectonexine recycles and empty caused clock expense with assembly line;
3, it is stored among a memory block with access due to present invention employs memory ping-pong operation, making to read originally, Two pieces of table tennis memories are divided into be stored;Clock expense caused by so as to avoid being read while write for memory improves meter Calculate efficiency;
4, macro transmission operation of present invention simulation is to use parallel instructions technology different caused by data level concurrent technique The data in sub-clustering are executed, are adjusted among identical execution sub-clustering, to ensure subsequent calculating;In the operation effectively avoids Bank conflicts are deposited, and improve each efficiency for executing macro data point reuse;
5, the further symmetry for excavating butterfly coefficient of the present invention, reduce butterfly coefficient in operation prefetches number, It is used with achieving the purpose that reduce register;The operation can reduce the twiddle factor of nearly half, and sky is used reducing memory Between while, reduce register by the occupied number of twiddle factor;
6, by experimental verification, the method for the present invention is defeated to its 1024 points in 32 floating-point version complex Fourier transforms The operation entered successfully will be compressed to 980 the clock cycle;The bottleneck functional component utilization rate that each layer calculates in structure reaches respectively To 96.68%, 98.25% and 100%.
Description of the drawings
Fig. 1 is the general flow chart of the present invention;
Fig. 2 is that macro transmission operational flowchart is simulated in the present invention;
Fig. 3 is that middle layer of the present invention calculates four layer models used.
Specific implementation mode
The purpose of the present invention is to propose to a kind of suitable for the floating of instruction level parallelism ILP and the parallel DLP hardware platforms of data level The optimization method of point version FFT, to which high performance optimization can be carried out on the hardware infrastructure of its offer.Following tools Body embodiment only optimizes the discussion of method using BWDSP104x platforms as example, however in the present invention optimisation technique and Method is not limited merely to BWDSP104x platforms.The hardware platform of any ILP and DLP is suitable for the prioritization scheme of the present invention In.
It is macro (x, y, z, t) that BWDSP104x platforms possess 4 execution, it is each it is macro in have 8 arithmetic logic unit (ALU), 8 A multiplier (MUL), 4 shift units (SHIFT), 1 super calculation device and one group of general register group for including 128 registers. It shares 11 level production lines, and each dos command line DOS at most can parallel 16 word instruction simultaneously.
In the present embodiment, a kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level be by Following steps carry out:
Step 1 assumes that the length for the FFT input vectors to be calculated is M, determines that the iteration number of plies is according to the length M N;It is illustrated so that input vector length is 1024 as an example in the present embodiment, other length can be implemented by similar scheme; Wherein M=2N;M, N is positive integer, and N >=6;M=10 at this time, N=1024;Define iteration number of plies N first four layers are in-degree Layer, layer 5 are middle layer to N-2 layers;N-1 layers are out-degree layer with n-th layer;Fig. 1 is the flow chart of this FFT calculating process, 1-4 steps depict in-degree layer calculating process in figure, 5-7 steps depict middle layer calculating process, 8-10 steps depict out Spend layer calculating process;
Step 2 is instructed using position inverted sequence, and the FFT input vectors inverted sequence is read in register, and by in-degree layer institute Corresponding FFT twiddle factors are read into corresponding register;Table one is 4 and executes macro digital signal processor, using position After data are read in inverted sequence instruction, the data that are stored in each register.The macro difference of same execution is resulted in by its instruction feature The read data of register are inverted sequence, and it is sequence that the difference of same register, which executes macro read data,.Table two is listed The details of twiddle factor needed for in-degree layer.From table two it can be seen that, twiddle factor needed for in-degree layer, can with three numbers come Instead of being respectively:Cos (π/4), sin (π/8) and cos (π/8).
The data that one inverted sequence of table is read (number represents its sequence in array in table)
x y z t
r7:6 0 1 2 3
r9:8 512 513 514 515
r11:10 256 257 258 259
r13:12 768 769 770 771
Two preceding four layers of twiddle factor of table
Step 3 carries out the calculating of in-degree layer butterfly to the FFT input vectors and FFT twiddle factors that are stored in register, obtains In-degree layer result of calculation deposit temporarily providing room in;The purpose for opening up temporarily providing room be in order to ensure its with input vector space into Row memory ping-pong operation so that calculating process can be completed at the same time read-write operation when flowing full water;
N -4 is assigned to n by step 4;
Step 5 judges whether n is odd number, if so, thening follow the steps 6, otherwise, executes step 8;
Step 6, from the temporarily providing room read in-degree layer result of calculation with N-n+1 layers corresponding to twiddle factor simultaneously Butterfly calculating is carried out, is obtained in N-n+1 layers of result of calculation covering storage to input vector space;
N-1 is assigned to n by step 7;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8;
Step 8 reads the rotation corresponding to result of calculation and N-n+1 layers to N-n+5 layers from the temporarily providing room The factor simultaneously carries out butterfly calculating, obtains in result of calculation covering storage to the temporarily providing room;The temporary sky in this step Between to be discussed respectively according to different disposal process;If this FFT calculating process needs if being calculated by odd-level, reading at this time It is input vector space to take space, and memory space is the temporarily providing room opened up;If this FFT calculating process is without passing through odd number If layer calculates, reading space at this time is the temporarily providing room opened up, and memory space is input vector space;Above is interior Deposit the process of ping-pong operation;
Intermediate four layers of computation model is similar with preceding four layers of computation model, and 16 numbers are only combined into a data Block, then as unit of data block, the data between each unit calculate.Fig. 3 is intermediate four layers of computation model schematic diagram, figure In as unit of data block, each butterfly computation carries out between the data of respective sequence within the data block respectively.The figure upper left corner In dotted line frame, simple description has been carried out to data block.
The present invention has further excavated a quarter symmetry of middle layer twiddle factor;Due to the middle layer meter of the present invention It calculates and is integrated with four layers of calculating for calculating structure originally so that existing data dependence (Date between former each layer Dependence, DP), it needs to preserve data by a large amount of register to solve;This just brings pole to the use of register Big pressure, especially for the DSP based on ILP and DLP carries out large-scale calculations;It can be divided at this time according to twiddle factor four One of symmetry, by the later half twiddle factor of current layer disguise oneself as the first half twiddle factor do once again multiplication;By formula (1) formula (3) derived with formula (1), the characteristics of characterizing a quarter symmetry possessed by the twiddle factor;
Calculating required twiddle factor for middle layer in the present embodiment is respectivelyWithWhen, realize butterfly computation Core code is respectively program segment 1 and program segment 2;In following procedure section, r11:10 and r13:The butterfly carried out needed for 12 storages Two groups of numbers of operation;r53:52 storagesThe real part and imaginary part of butterfly coefficient;r15:14 are used as temporary register, and storage is multiple The intermediate result that number is multiplied;
Program segment 1:
①cfr11:10=cfr11:10*fr53
②||cfr15:14=cfr11:10*fr52 // temporal data and twiddle factor
3. fr11=fr11-fr14
④||The calculating of real part imaginary part after fr10=fr10+fr15 // carry out complex multiplication.
⑤cfr13:12_11:10=cfr13:12+/–cfr11:10 // carry out butterfly computation
Program segment 2:
①cfr11:10=cfr11:10*fr53
②||cfr15:14=cfr11:10*fr52 // temporal data and twiddle factor
3. fr11=fr11-fr14
④||The calculating of real part imaginary part after fr10=fr10+fr15 // carry out complex multiplication.
⑤cfr13:12=cfr13:12–jcfr11:10
⑥||cfr11:10=cfr13:12+jcfr11:10 // carry out butterfly computation
N-4 is assigned to n by step 9;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8;
Step 10 is operated by simulating macro transmission, and the result of calculation transposition in the temporarily providing room is reset, and reads In twiddle factor to corresponding register corresponding to out-degree layer;
Step 11 carries out out-degree layer butterfly meter to the result of calculation after transposition rearrangement and the twiddle factor corresponding to out-degree layer It calculates, obtains in out-degree layer result of calculation storage to output memory headroom;To complete FFT floating-point optimization methods.
Wherein, macro transmission operation of simulation in step 10 is to carry out as follows:
There are K execution is macro for step 10.1, definition processor, wherein i-th of execution is macro to be denoted as Pi;1≤i≤K;K is just Integer;Continuous K rows are then instructed to macro transmission operational group of simulation as a K × K;The present embodiment is macro with 4 execution The description of process is carried out for processor;4 × 4 macro transmission operational group of simulation is listed in table three;Stream as shown in Figure 2 Operational group is transmitted in journey, macro of the simulation that build at the very start for simulating macro transmission operation;
Table three simulates macro transmission operational group
Macro 1 Macro 2 Macro 3 Macro 4
r6 0 1 2 3
r7 4 5 6 7
r8 8 9 10 11
r9 12 13 14 15
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4 executes macro P by i-th in the instruction of jth rowiInterior data are stored into the instruction of jth row (i+j-1) Mod K execute macro P(i+j-1)mod KIt is interior;Thus by the same data point reuse for executing macro middle different instruction row to corresponding instruction row Difference execute it is macro in;1≤j≤K;This step is to simulate the core process of macro transmission operation;As shown in Fig. 2, this process is The inside key operation of bilayer cycle;Its kernel program section is as follows:Wherein four are executed and macro are identified respectively with x, y, z and t;
①xr11:10=zr7:6||zr7:6=xr11:10||yr13:12=tr9:8||tr9:8=yr13:12
②xr9:8=yr7:6||yr7:6=xr9:8||zr13:12=tr11:10||tr11:10=zr13:12
③xr13:12=tr7:6||tr7:6=xr13:12||yr11:10=zr9:8||zr9:8=yr11:10
I+1 is assigned to i by step 10.5;And judge whether i > K are true, if so, then follow the steps 10.6;Otherwise, Return to step 10.4;
J+1 is assigned to j by step 10.6;And judge whether j > K are true, if so, then complete the transposition of result of calculation It resets;Otherwise, return to step 10.3;Table four gives the result finally reset in the present embodiment.
Table four simulates the result after macro transmission
Macro 1 Macro 2 Macro 3 Macro 4
r6 0 4 8 12
r7 1 5 9 13
r8 2 6 10 14
r9 3 7 11 15

Claims (1)

1. the FFT floating-point optimization methods of a kind of Parallel I of the grade based on instruction LP and parallel DLP of data level, it is characterized in that by following step It is rapid to carry out:
Step 1 assumes that the length for the FFT input vectors to be calculated is M, determines that the iteration number of plies is N according to the length M;Its Middle M=2N;M, N is positive integer, and N >=6;Define iteration number of plies N first four layers are in-degree layer, during layer 5 is to N-2 layers Interbed;N-1 layers are out-degree layer with n-th layer;
Step 2 is instructed using position inverted sequence, and the FFT input vectors inverted sequence is read in register, and will be corresponding to in-degree layer FFT twiddle factors be read into corresponding register;
Step 3 carries out the calculating of in-degree layer butterfly to the FFT input vectors and FFT twiddle factors that are stored in register, and what is obtained enters It spends in layer result of calculation deposit temporarily providing room;
N -4 is assigned to n by step 4;
Step 5 judges whether n is odd number, if so, thening follow the steps 6, otherwise, executes step 8;
Step 6, from the temporarily providing room read in-degree layer result of calculation with N-n+1 layers corresponding to twiddle factor and progress Butterfly calculates, and obtains in N-n+1 layers of result of calculation covering storage to input vector space;
N-1 is assigned to n by step 7;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8;
Step 8 reads the twiddle factor corresponding to result of calculation and N-n+1 layers to N-n+5 layers from the temporarily providing room And butterfly calculating is carried out, it obtains in result of calculation covering storage to the temporarily providing room;
N-4 is assigned to n by step 9;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8;
Step 10 is operated by simulating macro transmission, the result of calculation transposition in the temporarily providing room is reset, and read out-degree In twiddle factor to corresponding register corresponding to layer;
There are K execution is macro for step 10.1, definition processor, wherein i-th of execution is macro to be denoted as Pi;1≤i≤K;K is positive integer; Continuous K rows are then instructed to macro transmission operational group of simulation as a K × K;
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4 executes macro P by i-th in the instruction of jth rowiInterior data store (i+j-1) modK into the instruction of jth row It is a to execute macro P(i+j-1)modKIt is interior;Thus by the difference of the same data point reuse for executing macro middle different instruction row to corresponding instruction row Execute it is macro in;1≤j≤K;
I+1 is assigned to i by step 10.5;And judge whether i > K are true, if so, then follow the steps 10.6;Otherwise, it returns Step 10.4;
J+1 is assigned to j by step 10.6;And judge whether j > K are true, if so, the transposition for then completing result of calculation is reset; Otherwise, return to step 10.3;
Step 11 carries out the calculating of out-degree layer butterfly to the result of calculation after transposition rearrangement and the twiddle factor corresponding to out-degree layer, It obtains in out-degree layer result of calculation storage to output memory headroom;To complete FFT floating-point optimization methods.
CN201610473373.XA 2016-06-23 2016-06-23 A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level Active CN106095730B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610473373.XA CN106095730B (en) 2016-06-23 2016-06-23 A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610473373.XA CN106095730B (en) 2016-06-23 2016-06-23 A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level

Publications (2)

Publication Number Publication Date
CN106095730A CN106095730A (en) 2016-11-09
CN106095730B true CN106095730B (en) 2018-10-23

Family

ID=57253425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610473373.XA Active CN106095730B (en) 2016-06-23 2016-06-23 A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level

Country Status (1)

Country Link
CN (1) CN106095730B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101347B (en) * 2018-07-16 2021-07-20 北京理工大学 Pulse compression processing method of FPGA heterogeneous computing platform based on OpenCL
CN109783054B (en) * 2018-12-20 2021-03-09 中国科学院计算技术研究所 Butterfly operation processing method and system of RSFQ FFT processor

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902506A (en) * 2014-04-16 2014-07-02 中国科学技术大学先进技术研究院 FFTW3 optimization method based on loongson 3B processor
CN105630737A (en) * 2016-01-05 2016-06-01 合肥康捷信息科技有限公司 Optimizing method of split-radix FFT (fast fourier transform) algorithm based on ternary tree

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2417105B (en) * 2004-08-13 2008-04-09 Clearspeed Technology Plc Processor memory system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902506A (en) * 2014-04-16 2014-07-02 中国科学技术大学先进技术研究院 FFTW3 optimization method based on loongson 3B processor
CN105630737A (en) * 2016-01-05 2016-06-01 合肥康捷信息科技有限公司 Optimizing method of split-radix FFT (fast fourier transform) algorithm based on ternary tree

Also Published As

Publication number Publication date
CN106095730A (en) 2016-11-09

Similar Documents

Publication Publication Date Title
CN107341542B (en) Apparatus and method for performing recurrent neural networks and LSTM operations
Tanomoto et al. A cgra-based approach for accelerating convolutional neural networks
WO2021057746A1 (en) Neural network processing method and apparatus, computer device and storage medium
CN108694690A (en) Subgraph in frequency domain and the dynamic select to the convolution realization on GPU
CN104025067B (en) With the processor for being instructed by vector conflict and being replaced the shared full connection interconnection of instruction
CN103955447B (en) FFT accelerator based on DSP chip
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN107451097B (en) High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor
Fan et al. Stream processing dual-track CGRA for object inference
CN105468439A (en) Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework
CN105808309A (en) High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform
JP2023506343A (en) Vector reduction using shared scratchpad memory
CN110163333A (en) The parallel optimization method of convolutional neural networks
CN106095730B (en) A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level
CN108431770A (en) Hardware aspects associated data structures for accelerating set operation
CN106990995B (en) Circular block size selection method based on machine learning
CN109739556A (en) A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated
CN106933777A (en) The high-performance implementation method of the one-dimensional FFT of base 2 based on the domestic processor of Shen prestige 26010
CN110837483B (en) Tensor dimension transformation method and device
Li et al. Automatic FFT performance tuning on OpenCL GPUs
CN111178492B (en) Computing device, related product and computing method for executing artificial neural network model
Tan et al. A pipelining loop optimization method for dataflow architecture
Wu et al. Parallel artificial neural network using CUDA-enabled GPU for extracting hydraulic domain knowledge of large water distribution systems
CN106204669A (en) A kind of parallel image compression sensing method based on GPU platform
CN110008436A (en) Fast Fourier Transform (FFT) method, system and storage medium based on data stream architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant