CN106095730B - A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level - Google Patents
A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level Download PDFInfo
- Publication number
- CN106095730B CN106095730B CN201610473373.XA CN201610473373A CN106095730B CN 106095730 B CN106095730 B CN 106095730B CN 201610473373 A CN201610473373 A CN 201610473373A CN 106095730 B CN106095730 B CN 106095730B
- Authority
- CN
- China
- Prior art keywords
- macro
- calculation
- fft
- layer
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000005457 optimization Methods 0.000 title claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 32
- 230000005540 biological transmission Effects 0.000 claims abstract description 17
- 238000004088 simulation Methods 0.000 claims abstract description 10
- 239000013598 vector Substances 0.000 claims description 17
- 230000015654 memory Effects 0.000 claims description 12
- 238000003860 storage Methods 0.000 claims description 12
- 230000017105 transposition Effects 0.000 claims description 9
- 230000008707 rearrangement Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012913 prioritisation Methods 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/141—Discrete Fourier transforms
- G06F17/142—Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
Abstract
The invention discloses the FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level a kind of, it is characterized in that carrying out as follows:1, determine the iteration number of plies, and be divided into three-decker;2, by using operations such as position inverted sequence instructions, completes in-degree layer and calculate;3, after completing the calculating of in-degree layer, classifies to the middle layer calculating that will be carried out, operation respectively is carried out to odd-level and two kinds of situations of even level, and obtain middle layer result of calculation;4, using macro transmission operation of simulation, middle layer result of calculation is adjusted, and complete the calculating of out-degree layer.The present invention can solve the instruction in the presence of annual reporting law relevant the problem of being limited with structure, and give full play to arithmetic unit load efficiency, to increase substantially the average utilization of bottleneck.
Description
Technical field
The invention belongs to vector processor and digital processing fields, and in particular to the hardware based on ILP and DLP is flat
Floating-point version FFT realizes the method efficiently calculated on platform.
Background technology
Discrete Fourier transform (Discrete Fourier Transform, DFT) is in modern signal processing system regions
In be widely used, such as Radar Signal Processing, SAR image processing, sonar calculating, video image algorithm, spectrum analysis, speech recognition
Deng.It is typical computation-intensive and memory access intensive applications, such as the calculating complexity of the DFT transform of N points that Fourier, which changes calculating,
Degree is O (N2).A kind of nineteen sixty-five Fast Fourier Transform (FFT) of Cooley and Turkey proposition (Fast Fourier Transform,
FFT) computational methods can significantly decrease operand, and computation complexity is by original O (N2) fall below O (Nlog2N).At signal
Ought to use it is usually very high to the requirement of real-time of calculating, FFT computational efficiencies are higher, and the real-time of signal processing is better.
Instruction level parallelism (Instruction Level Parallelism, ILP) is finger processor in same instruction week
The a plurality of instruction executed parallel of transmitting in phase.Data level parallel (Data Level Parallelism, DLP) is referred to same
One moment carried out different data the architecture of parallel computation.Hardware platform based on ILP and DLP usually can all use VLIW
With SIMD technologies, large-scale efficient operation can be carried out.
Since the hardware platform that ILP is combined with DLP technologies is complex, the research based on its Fast Fourier Transform (FFT)
It is not unfolded.
Invention content
The present invention is to propose a kind of grade Parallel I LP and data level based on instruction in place of overcoming the shortcomings of the prior art
The FFT floating-point optimization methods of parallel DLP enable relevant and structure limitation to solve annual reporting law middle finger, and give full play to operation
Component load efficiency, to increase substantially the average utilization of bottleneck.
In order to solve the above-mentioned technical problem, the present invention uses following technical scheme:
A kind of the characteristics of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level of the present invention be by
Following steps carry out:
Step 1 assumes that the length for the FFT input vectors to be calculated is M, determines that the iteration number of plies is according to the length M
N;Wherein M=2N;M, N is positive integer, and N >=6;Define iteration number of plies N first four layers are in-degree layer, and layer 5 is to N-2 layers
For middle layer;N-1 layers are out-degree layer with n-th layer;
Step 2 is instructed using position inverted sequence, and the FFT input vectors inverted sequence is read in register, and by in-degree layer institute
Corresponding FFT twiddle factors are read into corresponding register;
Step 3 carries out the calculating of in-degree layer butterfly to the FFT input vectors and FFT twiddle factors that are stored in register, obtains
In-degree layer result of calculation deposit temporarily providing room in;
N -4 is assigned to n by step 4;
Step 5 judges whether n is odd number, if so, thening follow the steps 6, otherwise, executes step 8;
Step 6, from the temporarily providing room read in-degree layer result of calculation with N-n+1 layers corresponding to twiddle factor simultaneously
Butterfly calculating is carried out, is obtained in N-n+1 layers of result of calculation covering storage to input vector space;
N-1 is assigned to n by step 7;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step
8;
Step 8 reads the rotation corresponding to result of calculation and N-n+1 layers to N-n+5 layers from the temporarily providing room
The factor simultaneously carries out butterfly calculating, obtains in result of calculation covering storage to the temporarily providing room;
N-4 is assigned to n by step 9;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step
8;
Step 10 is operated by simulating macro transmission, and the result of calculation transposition in the temporarily providing room is reset, and reads
In twiddle factor to corresponding register corresponding to out-degree layer;
Step 11 carries out out-degree layer butterfly meter to the result of calculation after transposition rearrangement and the twiddle factor corresponding to out-degree layer
It calculates, obtains in out-degree layer result of calculation storage to output memory headroom;To complete FFT floating-point optimization methods.
The characteristics of FFT floating-point optimization methods of the present invention based on ILP and DLP, lies also in,
Macro transmission operation of simulation in the step 10 is to carry out as follows:
There are K execution is macro for step 10.1, definition processor, wherein i-th of execution is macro to be denoted as Pi;1≤i≤K;K is just
Integer;Continuous K rows are then instructed to macro transmission operational group of simulation as a K × K;
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4 executes macro P by i-th in the instruction of jth rowiInterior data are stored into the instruction of jth row (i+j-1)
Mod K execute macro P(i+j-1)mod KIt is interior;Thus by the same data point reuse for executing macro middle different instruction row to corresponding instruction row
Difference execute it is macro in;1≤j≤K;
I+1 is assigned to i by step 10.5;And judge whether i > K are true, if so, then follow the steps 10.6;Otherwise,
Return to step 10.4;
J+1 is assigned to j by step 10.6;And judge whether j > K are true, if so, then complete the transposition of result of calculation
It resets;Otherwise, return to step 10.3.
Compared with the prior art, the present invention has the beneficial effect that:
1, the present invention proposes a kind of new floating-point version FFT optimization methods, to adapt to the characteristics of ILP is with DLP hardware platforms,
By adjusting two Cooley-Tukey algorithm structures of base, while compressing its calculating number of plies, is operated using macro transmission of simulation, is interior
The technologies such as ping-pong operation and cache operations are deposited, to the hardware platform based on ILP Yu DLP technologies, carry out fast Fourier change
The efficient deployment changed;Operation clock expense is effectively reduced, to improve hardware platform for Fast Fourier Transform (FFT) meter
The efficiency of calculation;
2, since the present invention uses three layers of computing structure model so that the calculating of script multilayered structure becomes three layers;From
And reduces the content of registers refreshing caused by being dispatched between ectonexine recycles and empty caused clock expense with assembly line;
3, it is stored among a memory block with access due to present invention employs memory ping-pong operation, making to read originally,
Two pieces of table tennis memories are divided into be stored;Clock expense caused by so as to avoid being read while write for memory improves meter
Calculate efficiency;
4, macro transmission operation of present invention simulation is to use parallel instructions technology different caused by data level concurrent technique
The data in sub-clustering are executed, are adjusted among identical execution sub-clustering, to ensure subsequent calculating;In the operation effectively avoids
Bank conflicts are deposited, and improve each efficiency for executing macro data point reuse;
5, the further symmetry for excavating butterfly coefficient of the present invention, reduce butterfly coefficient in operation prefetches number,
It is used with achieving the purpose that reduce register;The operation can reduce the twiddle factor of nearly half, and sky is used reducing memory
Between while, reduce register by the occupied number of twiddle factor;
6, by experimental verification, the method for the present invention is defeated to its 1024 points in 32 floating-point version complex Fourier transforms
The operation entered successfully will be compressed to 980 the clock cycle;The bottleneck functional component utilization rate that each layer calculates in structure reaches respectively
To 96.68%, 98.25% and 100%.
Description of the drawings
Fig. 1 is the general flow chart of the present invention;
Fig. 2 is that macro transmission operational flowchart is simulated in the present invention;
Fig. 3 is that middle layer of the present invention calculates four layer models used.
Specific implementation mode
The purpose of the present invention is to propose to a kind of suitable for the floating of instruction level parallelism ILP and the parallel DLP hardware platforms of data level
The optimization method of point version FFT, to which high performance optimization can be carried out on the hardware infrastructure of its offer.Following tools
Body embodiment only optimizes the discussion of method using BWDSP104x platforms as example, however in the present invention optimisation technique and
Method is not limited merely to BWDSP104x platforms.The hardware platform of any ILP and DLP is suitable for the prioritization scheme of the present invention
In.
It is macro (x, y, z, t) that BWDSP104x platforms possess 4 execution, it is each it is macro in have 8 arithmetic logic unit (ALU), 8
A multiplier (MUL), 4 shift units (SHIFT), 1 super calculation device and one group of general register group for including 128 registers.
It shares 11 level production lines, and each dos command line DOS at most can parallel 16 word instruction simultaneously.
In the present embodiment, a kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level be by
Following steps carry out:
Step 1 assumes that the length for the FFT input vectors to be calculated is M, determines that the iteration number of plies is according to the length M
N;It is illustrated so that input vector length is 1024 as an example in the present embodiment, other length can be implemented by similar scheme;
Wherein M=2N;M, N is positive integer, and N >=6;M=10 at this time, N=1024;Define iteration number of plies N first four layers are in-degree
Layer, layer 5 are middle layer to N-2 layers;N-1 layers are out-degree layer with n-th layer;Fig. 1 is the flow chart of this FFT calculating process,
1-4 steps depict in-degree layer calculating process in figure, 5-7 steps depict middle layer calculating process, 8-10 steps depict out
Spend layer calculating process;
Step 2 is instructed using position inverted sequence, and the FFT input vectors inverted sequence is read in register, and by in-degree layer institute
Corresponding FFT twiddle factors are read into corresponding register;Table one is 4 and executes macro digital signal processor, using position
After data are read in inverted sequence instruction, the data that are stored in each register.The macro difference of same execution is resulted in by its instruction feature
The read data of register are inverted sequence, and it is sequence that the difference of same register, which executes macro read data,.Table two is listed
The details of twiddle factor needed for in-degree layer.From table two it can be seen that, twiddle factor needed for in-degree layer, can with three numbers come
Instead of being respectively:Cos (π/4), sin (π/8) and cos (π/8).
The data that one inverted sequence of table is read (number represents its sequence in array in table)
x | y | z | t | |
r7:6 | 0 | 1 | 2 | 3 |
r9:8 | 512 | 513 | 514 | 515 |
r11:10 | 256 | 257 | 258 | 259 |
r13:12 | 768 | 769 | 770 | 771 |
Two preceding four layers of twiddle factor of table
Step 3 carries out the calculating of in-degree layer butterfly to the FFT input vectors and FFT twiddle factors that are stored in register, obtains
In-degree layer result of calculation deposit temporarily providing room in;The purpose for opening up temporarily providing room be in order to ensure its with input vector space into
Row memory ping-pong operation so that calculating process can be completed at the same time read-write operation when flowing full water;
N -4 is assigned to n by step 4;
Step 5 judges whether n is odd number, if so, thening follow the steps 6, otherwise, executes step 8;
Step 6, from the temporarily providing room read in-degree layer result of calculation with N-n+1 layers corresponding to twiddle factor simultaneously
Butterfly calculating is carried out, is obtained in N-n+1 layers of result of calculation covering storage to input vector space;
N-1 is assigned to n by step 7;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step
8;
Step 8 reads the rotation corresponding to result of calculation and N-n+1 layers to N-n+5 layers from the temporarily providing room
The factor simultaneously carries out butterfly calculating, obtains in result of calculation covering storage to the temporarily providing room;The temporary sky in this step
Between to be discussed respectively according to different disposal process;If this FFT calculating process needs if being calculated by odd-level, reading at this time
It is input vector space to take space, and memory space is the temporarily providing room opened up;If this FFT calculating process is without passing through odd number
If layer calculates, reading space at this time is the temporarily providing room opened up, and memory space is input vector space;Above is interior
Deposit the process of ping-pong operation;
Intermediate four layers of computation model is similar with preceding four layers of computation model, and 16 numbers are only combined into a data
Block, then as unit of data block, the data between each unit calculate.Fig. 3 is intermediate four layers of computation model schematic diagram, figure
In as unit of data block, each butterfly computation carries out between the data of respective sequence within the data block respectively.The figure upper left corner
In dotted line frame, simple description has been carried out to data block.
The present invention has further excavated a quarter symmetry of middle layer twiddle factor;Due to the middle layer meter of the present invention
It calculates and is integrated with four layers of calculating for calculating structure originally so that existing data dependence (Date between former each layer
Dependence, DP), it needs to preserve data by a large amount of register to solve;This just brings pole to the use of register
Big pressure, especially for the DSP based on ILP and DLP carries out large-scale calculations;It can be divided at this time according to twiddle factor four
One of symmetry, by the later half twiddle factor of current layer disguise oneself as the first half twiddle factor do once again multiplication;By formula
(1) formula (3) derived with formula (1), the characteristics of characterizing a quarter symmetry possessed by the twiddle factor;
Calculating required twiddle factor for middle layer in the present embodiment is respectivelyWithWhen, realize butterfly computation
Core code is respectively program segment 1 and program segment 2;In following procedure section, r11:10 and r13:The butterfly carried out needed for 12 storages
Two groups of numbers of operation;r53:52 storagesThe real part and imaginary part of butterfly coefficient;r15:14 are used as temporary register, and storage is multiple
The intermediate result that number is multiplied;
Program segment 1:
①cfr11:10=cfr11:10*fr53
②||cfr15:14=cfr11:10*fr52 // temporal data and twiddle factor
3. fr11=fr11-fr14
④||The calculating of real part imaginary part after fr10=fr10+fr15 // carry out complex multiplication.
⑤cfr13:12_11:10=cfr13:12+/–cfr11:10 // carry out butterfly computation
Program segment 2:
①cfr11:10=cfr11:10*fr53
②||cfr15:14=cfr11:10*fr52 // temporal data and twiddle factor
3. fr11=fr11-fr14
④||The calculating of real part imaginary part after fr10=fr10+fr15 // carry out complex multiplication.
⑤cfr13:12=cfr13:12–jcfr11:10
⑥||cfr11:10=cfr13:12+jcfr11:10 // carry out butterfly computation
N-4 is assigned to n by step 9;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step
8;
Step 10 is operated by simulating macro transmission, and the result of calculation transposition in the temporarily providing room is reset, and reads
In twiddle factor to corresponding register corresponding to out-degree layer;
Step 11 carries out out-degree layer butterfly meter to the result of calculation after transposition rearrangement and the twiddle factor corresponding to out-degree layer
It calculates, obtains in out-degree layer result of calculation storage to output memory headroom;To complete FFT floating-point optimization methods.
Wherein, macro transmission operation of simulation in step 10 is to carry out as follows:
There are K execution is macro for step 10.1, definition processor, wherein i-th of execution is macro to be denoted as Pi;1≤i≤K;K is just
Integer;Continuous K rows are then instructed to macro transmission operational group of simulation as a K × K;The present embodiment is macro with 4 execution
The description of process is carried out for processor;4 × 4 macro transmission operational group of simulation is listed in table three;Stream as shown in Figure 2
Operational group is transmitted in journey, macro of the simulation that build at the very start for simulating macro transmission operation;
Table three simulates macro transmission operational group
Macro 1 | Macro 2 | Macro 3 | Macro 4 | |
r6 | 0 | 1 | 2 | 3 |
r7 | 4 | 5 | 6 | 7 |
r8 | 8 | 9 | 10 | 11 |
r9 | 12 | 13 | 14 | 15 |
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4 executes macro P by i-th in the instruction of jth rowiInterior data are stored into the instruction of jth row (i+j-1)
Mod K execute macro P(i+j-1)mod KIt is interior;Thus by the same data point reuse for executing macro middle different instruction row to corresponding instruction row
Difference execute it is macro in;1≤j≤K;This step is to simulate the core process of macro transmission operation;As shown in Fig. 2, this process is
The inside key operation of bilayer cycle;Its kernel program section is as follows:Wherein four are executed and macro are identified respectively with x, y, z and t;
①xr11:10=zr7:6||zr7:6=xr11:10||yr13:12=tr9:8||tr9:8=yr13:12
②xr9:8=yr7:6||yr7:6=xr9:8||zr13:12=tr11:10||tr11:10=zr13:12
③xr13:12=tr7:6||tr7:6=xr13:12||yr11:10=zr9:8||zr9:8=yr11:10
I+1 is assigned to i by step 10.5;And judge whether i > K are true, if so, then follow the steps 10.6;Otherwise,
Return to step 10.4;
J+1 is assigned to j by step 10.6;And judge whether j > K are true, if so, then complete the transposition of result of calculation
It resets;Otherwise, return to step 10.3;Table four gives the result finally reset in the present embodiment.
Table four simulates the result after macro transmission
Macro 1 | Macro 2 | Macro 3 | Macro 4 | |
r6 | 0 | 4 | 8 | 12 |
r7 | 1 | 5 | 9 | 13 |
r8 | 2 | 6 | 10 | 14 |
r9 | 3 | 7 | 11 | 15 |
Claims (1)
1. the FFT floating-point optimization methods of a kind of Parallel I of the grade based on instruction LP and parallel DLP of data level, it is characterized in that by following step
It is rapid to carry out:
Step 1 assumes that the length for the FFT input vectors to be calculated is M, determines that the iteration number of plies is N according to the length M;Its
Middle M=2N;M, N is positive integer, and N >=6;Define iteration number of plies N first four layers are in-degree layer, during layer 5 is to N-2 layers
Interbed;N-1 layers are out-degree layer with n-th layer;
Step 2 is instructed using position inverted sequence, and the FFT input vectors inverted sequence is read in register, and will be corresponding to in-degree layer
FFT twiddle factors be read into corresponding register;
Step 3 carries out the calculating of in-degree layer butterfly to the FFT input vectors and FFT twiddle factors that are stored in register, and what is obtained enters
It spends in layer result of calculation deposit temporarily providing room;
N -4 is assigned to n by step 4;
Step 5 judges whether n is odd number, if so, thening follow the steps 6, otherwise, executes step 8;
Step 6, from the temporarily providing room read in-degree layer result of calculation with N-n+1 layers corresponding to twiddle factor and progress
Butterfly calculates, and obtains in N-n+1 layers of result of calculation covering storage to input vector space;
N-1 is assigned to n by step 7;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8;
Step 8 reads the twiddle factor corresponding to result of calculation and N-n+1 layers to N-n+5 layers from the temporarily providing room
And butterfly calculating is carried out, it obtains in result of calculation covering storage to the temporarily providing room;
N-4 is assigned to n by step 9;Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8;
Step 10 is operated by simulating macro transmission, the result of calculation transposition in the temporarily providing room is reset, and read out-degree
In twiddle factor to corresponding register corresponding to layer;
There are K execution is macro for step 10.1, definition processor, wherein i-th of execution is macro to be denoted as Pi;1≤i≤K;K is positive integer;
Continuous K rows are then instructed to macro transmission operational group of simulation as a K × K;
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4 executes macro P by i-th in the instruction of jth rowiInterior data store (i+j-1) modK into the instruction of jth row
It is a to execute macro P(i+j-1)modKIt is interior;Thus by the difference of the same data point reuse for executing macro middle different instruction row to corresponding instruction row
Execute it is macro in;1≤j≤K;
I+1 is assigned to i by step 10.5;And judge whether i > K are true, if so, then follow the steps 10.6;Otherwise, it returns
Step 10.4;
J+1 is assigned to j by step 10.6;And judge whether j > K are true, if so, the transposition for then completing result of calculation is reset;
Otherwise, return to step 10.3;
Step 11 carries out the calculating of out-degree layer butterfly to the result of calculation after transposition rearrangement and the twiddle factor corresponding to out-degree layer,
It obtains in out-degree layer result of calculation storage to output memory headroom;To complete FFT floating-point optimization methods.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610473373.XA CN106095730B (en) | 2016-06-23 | 2016-06-23 | A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610473373.XA CN106095730B (en) | 2016-06-23 | 2016-06-23 | A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106095730A CN106095730A (en) | 2016-11-09 |
CN106095730B true CN106095730B (en) | 2018-10-23 |
Family
ID=57253425
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610473373.XA Active CN106095730B (en) | 2016-06-23 | 2016-06-23 | A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106095730B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101347B (en) * | 2018-07-16 | 2021-07-20 | 北京理工大学 | Pulse compression processing method of FPGA heterogeneous computing platform based on OpenCL |
CN109783054B (en) * | 2018-12-20 | 2021-03-09 | 中国科学院计算技术研究所 | Butterfly operation processing method and system of RSFQ FFT processor |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902506A (en) * | 2014-04-16 | 2014-07-02 | 中国科学技术大学先进技术研究院 | FFTW3 optimization method based on loongson 3B processor |
CN105630737A (en) * | 2016-01-05 | 2016-06-01 | 合肥康捷信息科技有限公司 | Optimizing method of split-radix FFT (fast fourier transform) algorithm based on ternary tree |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2417105B (en) * | 2004-08-13 | 2008-04-09 | Clearspeed Technology Plc | Processor memory system |
-
2016
- 2016-06-23 CN CN201610473373.XA patent/CN106095730B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902506A (en) * | 2014-04-16 | 2014-07-02 | 中国科学技术大学先进技术研究院 | FFTW3 optimization method based on loongson 3B processor |
CN105630737A (en) * | 2016-01-05 | 2016-06-01 | 合肥康捷信息科技有限公司 | Optimizing method of split-radix FFT (fast fourier transform) algorithm based on ternary tree |
Also Published As
Publication number | Publication date |
---|---|
CN106095730A (en) | 2016-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107341542B (en) | Apparatus and method for performing recurrent neural networks and LSTM operations | |
Tanomoto et al. | A cgra-based approach for accelerating convolutional neural networks | |
WO2021057746A1 (en) | Neural network processing method and apparatus, computer device and storage medium | |
CN108694690A (en) | Subgraph in frequency domain and the dynamic select to the convolution realization on GPU | |
CN104025067B (en) | With the processor for being instructed by vector conflict and being replaced the shared full connection interconnection of instruction | |
CN103955447B (en) | FFT accelerator based on DSP chip | |
CN111105023B (en) | Data stream reconstruction method and reconfigurable data stream processor | |
CN107451097B (en) | High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor | |
Fan et al. | Stream processing dual-track CGRA for object inference | |
CN105468439A (en) | Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework | |
CN105808309A (en) | High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform | |
JP2023506343A (en) | Vector reduction using shared scratchpad memory | |
CN110163333A (en) | The parallel optimization method of convolutional neural networks | |
CN106095730B (en) | A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level | |
CN108431770A (en) | Hardware aspects associated data structures for accelerating set operation | |
CN106990995B (en) | Circular block size selection method based on machine learning | |
CN109739556A (en) | A kind of general deep learning processor that interaction is cached based on multiple parallel and is calculated | |
CN106933777A (en) | The high-performance implementation method of the one-dimensional FFT of base 2 based on the domestic processor of Shen prestige 26010 | |
CN110837483B (en) | Tensor dimension transformation method and device | |
Li et al. | Automatic FFT performance tuning on OpenCL GPUs | |
CN111178492B (en) | Computing device, related product and computing method for executing artificial neural network model | |
Tan et al. | A pipelining loop optimization method for dataflow architecture | |
Wu et al. | Parallel artificial neural network using CUDA-enabled GPU for extracting hydraulic domain knowledge of large water distribution systems | |
CN106204669A (en) | A kind of parallel image compression sensing method based on GPU platform | |
CN110008436A (en) | Fast Fourier Transform (FFT) method, system and storage medium based on data stream architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |