CN106095730A - A kind of FFT floating-point optimization method based on ILP and DLP - Google Patents
A kind of FFT floating-point optimization method based on ILP and DLP Download PDFInfo
- Publication number
- CN106095730A CN106095730A CN201610473373.XA CN201610473373A CN106095730A CN 106095730 A CN106095730 A CN 106095730A CN 201610473373 A CN201610473373 A CN 201610473373A CN 106095730 A CN106095730 A CN 106095730A
- Authority
- CN
- China
- Prior art keywords
- layer
- grand
- calculation
- fft
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000005457 optimization Methods 0.000 title claims abstract description 15
- 238000004364 calculation method Methods 0.000 claims abstract description 32
- 230000005540 biological transmission Effects 0.000 claims abstract description 16
- 238000004088 simulation Methods 0.000 claims abstract description 14
- 230000015654 memory Effects 0.000 claims description 12
- 230000017105 transposition Effects 0.000 claims description 9
- 238000005516 engineering process Methods 0.000 description 6
- 238000003860 storage Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012913 prioritisation Methods 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
- G06F17/141—Discrete Fourier transforms
- G06F17/142—Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
Abstract
The invention discloses a kind of FFT floating-point optimization method based on ILP and DLP, it is characterized in that carrying out as follows: 1, determine the iteration number of plies, and be divided into three-decker;2, by using the operations such as position inverted sequence instruction, complete in-degree layer and calculate;3, after completing the calculating of in-degree layer, the intermediate layer that will carry out is calculated and classifies, odd-level and two kinds of situations of even level are carried out computing respectively, and obtains intermediate layer result of calculation;4, use grand of simulation transmission operation, intermediate layer result of calculation is adjusted, and complete the calculating of out-degree layer.The present invention can solve the relevant problem limited with structure of the instruction in the presence of annual reporting law, and gives full play to arithmetic unit load usefulness, thus increases substantially the average utilization of bottleneck.
Description
Technical field
The invention belongs to vector processor and digital processing field, be specifically related to hardware based on ILP and DLP and put down
The method that on platform, floating-point version FFT realizes efficiently calculating.
Background technology
Discrete Fourier transform (Discrete Fourier Transform, DFT) is at modern signal processing system regions
In be widely used, such as Radar Signal Processing, SAR image process, sonar calculating, video image algorithm, spectrum analysis, speech recognition
Deng.Fourier's change calculations is typical computation-intensive and memory access intensive applications, and the calculating of the DFT transform of such as N point is complicated
Degree is O (N2).A kind of fast Fourier transform of nineteen sixty-five Cooley and Turkey proposition (Fast Fourier Transform,
FFT) computational methods, can significantly decrease operand, and computation complexity is by original O (N2) fall below O (Nlog2N).At signal
Ought to be with generally the highest to the requirement of real-time calculated, FFT computational efficiency is the highest, and the real-time of signal processing is the best.
Instruction level parallelism (Instruction Level Parallelism, ILP) is that finger processor is in same instruction week
The instruction of a plurality of executed in parallel is launched in phase.Data level parallel (Data Level Parallelism, DLP) refers to same
One moment carried out the architecture of parallel computation to different pieces of information.Hardware platform based on ILP Yu DLP the most all can use VLIW
With SIMD technology so that it is large-scale efficient computing can be carried out.
The hardware platform combined due to ILP with DLP technology is complex, the research of fast Fourier transform based on it
Do not launched.
Summary of the invention
The present invention is the weak point overcoming prior art to exist, and proposes a kind of FFT floating-point optimization based on ILP and DLP
Method, is concerned with and the restriction of structure to solving the order of annual reporting law middle finger, and gives full play to arithmetic unit load usefulness, thus significantly
Degree improves the average utilization of bottleneck.
In order to solve above-mentioned technical problem, the present invention by the following technical solutions:
The feature of a kind of FFT floating-point optimization method based on ILP and DLP of the present invention is to carry out as follows:
Step 1, assume a length of M of FFT input vector to be calculated, determine that the iteration number of plies is according to described length M
N;Wherein M=2N;M, N are positive integer, and N >=6;First four layers of definition iteration number of plies N is in-degree layer, and layer 5 is to N-2 layer
For intermediate layer;N-1 layer and n-th layer are out-degree layer;
Step 2, the inverted sequence instruction of use position, read described FFT input vector inverted sequence in depositor, and by in-degree layer institute
Corresponding FFT twiddle factor is read in corresponding depositor;
Step 3, the FFT input vector being stored in depositor and FFT twiddle factor are carried out in-degree layer butterfly calculate, obtain
In-degree layer result of calculation be stored in temporarily providing room;
Step 4, N 4 is assigned to n;
Step 5, judge whether n is odd number, the most then perform step 6, otherwise, perform step 8;
Step 6, from described temporarily providing room read in-degree layer result of calculation with corresponding to N-n+1 layer twiddle factor also
Carry out butterfly calculating, obtain the covering of N-n+1 layer result of calculation and store in input vector space;
Step 7, n-1 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step
8;
Step 8, read result of calculation and N-n+1 layer from described temporarily providing room to the rotation corresponding to N-n+5 layer
The factor also carries out butterfly calculating, obtains result of calculation covering and stores in described temporarily providing room;
Step 9, n-4 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step
8;
Step 10, by grand of simulation transmission operation, the result of calculation transposition in described temporarily providing room is reset, and reads
Twiddle factor corresponding to out-degree layer is in corresponding depositor;
Step 11, transposition is reset after result of calculation and twiddle factor corresponding to out-degree layer carry out out-degree layer butterfly meter
Calculate, obtain out-degree layer result of calculation and store in output memory headroom;Thus complete FFT floating-point optimization method.
The feature of FFT floating-point optimization method based on ILP and DLP of the present invention lies also in,
Grand of simulation in described step 10 transmission operation is to carry out as follows:
It is grand to there is K execution in step 10.1, definition processor, and wherein, i-th execution is grand is designated as Pi;1≤i≤K;K is just
Integer;Then continuous print K row is instructed grand of the simulation as a K × K and transmits operational group;
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4, by jth row instruct in i-th perform grand PiInterior data store (i+j-1) in instructing to jth row
Mod K performs grand P(i+j-1)mod KIn;Thus by the data point reuse of grand for same execution middle different instruction row to corresponding dos command line DOS
Difference perform grand in;1≤j≤K;
Step 10.5, i+1 is assigned to i;And judge whether i > K sets up, if setting up, then perform step 10.6;Otherwise,
Return step 10.4;
Step 10.6, j+1 is assigned to j;And judge whether j > K sets up, if setting up, then complete the transposition of result of calculation
Reset;Otherwise, step 10.3 is returned.
Compared with the prior art, the present invention has the beneficial effect that:
1, the present invention proposes a kind of new floating-point version FFT optimization method, to adapt to the feature of ILP Yu DLP hardware platform,
By adjusting base two Cooley-Tukey algorithm structure, compress it and calculate while number of plies, use grand of simulation transmission operation, interior
Deposit the technology such as ping-pong operation and cache operations, to hardware platform based on ILP Yu DLP technology, carry out fast Fourier change
The efficient deployment changed;Effectively reduce operation clock expense, thus improve hardware platform for fast Fourier transform meter
The efficiency calculated;
2, use three layers of computing structure model due to the present invention so that the calculating of multiple structure originally, become three layers;From
And decrease the content of registers refreshing that between ectonexine circulation, scheduling is caused and empty caused clock expense with streamline;
3, owing to present invention employs internal memory ping-pong operation, reading and peek originally is made to be stored among a memory block,
Divide into two pieces of table tennis internal memories to store;Thus avoid and internal memory is read while write to the clock expense caused, improve meter
Calculate efficiency;
4, the present invention simulates grand transmission operation is the difference using parallel instructions technology to be caused by data level concurrent technique
Perform the data in sub-clustering, be adjusted among identical execution sub-clustering, to ensure follow-up calculating;In this operation effectively avoids
Deposit Bank conflict, and improve each efficiency performing grand data point reuse;
5, the present invention further excavates the symmetry of butterfly coefficient, and decrease butterfly coefficient in computing prefetches number,
To reach to reduce the purpose that depositor uses;This operation can reduce the twiddle factor of nearly half, uses sky reducing internal memory
While between, decrease depositor by the number shared by twiddle factor;
6, through experimental verification, the inventive method is in 32 floating-point version complex Fourier transform, defeated to its 1024
The computing entered successfully will be compressed to 980 the clock cycle;Bottleneck functional part utilization rate in each layer computation structure reaches respectively
To 96.68%, 98.25% and 100%.
Accompanying drawing explanation
Fig. 1 is the general flow chart of the present invention;
Fig. 2 is to simulate grand transmission operational flowchart in the present invention;
Fig. 3 is that intermediate layer of the present invention calculates four layer models used.
Detailed description of the invention
The purpose of the present invention is to propose to the optimization method of a kind of floating-point version FFT being applicable to ILP and DLP hardware platform,
To high performance optimization can be carried out on its hardware infrastructure provided.Following detailed description of the invention only with
BWDSP104x platform is optimized the discussion of method as an example, but optimisation technique and method not only limit in the present invention
In BWDSP104x platform.The hardware platform of any ILP and DLP is suitable in the prioritization scheme of the present invention.
BWDSP104x platform have 4 execution grand (x, y, z, t), each grand in have 8 ALUs (ALU), 8
Individual multiplier (MUL), 4 shift units (SHIFT), 1 surpasses calculation device and one group of general purpose register set comprising 128 depositors.
It has 11 level production lines, and each dos command line DOS at most can the most parallel 16 word instructions.
In the present embodiment, a kind of FFT floating-point optimization method based on ILP and DLP is to carry out as follows:
Step 1, assume a length of M of FFT input vector to be calculated, determine that the iteration number of plies is according to described length M
N;Illustrating as a example by input vector a length of 1024 in the present embodiment, other length can be implemented by similar scheme;
Wherein M=2N;M, N are positive integer, and N >=6;M=10 now, N=1024;First four layers of definition iteration number of plies N is in-degree
Layer, layer 5 to N-2 layer is intermediate layer;N-1 layer and n-th layer are out-degree layer;Figure one calculates the flow process of process for this FFT
Figure, in figure, 1-4 step depicts in-degree layer calculating process, 5-7 step depicts intermediate layer calculating process, 8-10 step depicts
Out-degree layer calculates process;
Step 2, the inverted sequence instruction of use position, read described FFT input vector inverted sequence in depositor, and by in-degree layer institute
Corresponding FFT twiddle factor is read in corresponding depositor;Table one is 4 and performs grand digital signal processor, uses position
After data are read in inverted sequence instruction, the data of storage in each depositor.The difference that same execution is grand is result in by its instruction feature
The data that depositor is read are inverted sequence, and grand the read data of the difference of same depositor execution are order.Table two is listed
The details of twiddle factor needed for in-degree layer.Twiddle factor from table two it will be seen that needed for in-degree layer, can come by three numbers
Replace, respectively: cos (π/4), sin (π/8) and cos (π/8).
The data (its order in array of digitized representation in table) that one inverted sequence of table reads
x | y | z | t | |
r7:6 | 0 | 1 | 2 | 3 |
r9:8 | 512 | 513 | 514 | 515 |
r11:10 | 256 | 257 | 258 | 259 |
r13:12 | 768 | 769 | 770 | 771 |
The front four layers of twiddle factor of table two
Step 3, the FFT input vector being stored in depositor and FFT twiddle factor are carried out in-degree layer butterfly calculate, obtain
In-degree layer result of calculation be stored in temporarily providing room;The purpose opening up temporarily providing room is to ensure that it enters with input vector space
Row internal memory ping-pong operation so that calculating process can complete read-write operation flowing full water when simultaneously;
Step 4, N 4 is assigned to n;
Step 5, judge whether n is odd number, the most then perform step 6, otherwise, perform step 8;
Step 6, from described temporarily providing room read in-degree layer result of calculation with corresponding to N-n+1 layer twiddle factor also
Carry out butterfly calculating, obtain the covering of N-n+1 layer result of calculation and store in input vector space;
Step 7, n-1 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step
8;
Step 8, read result of calculation and N-n+1 layer from described temporarily providing room to the rotation corresponding to N-n+5 layer
The factor also carries out butterfly calculating, obtains result of calculation covering and stores in described temporarily providing room;Described temporary sky in this step
Between to discuss respectively according to different disposal process;If if this FFT calculating process needs to be calculated by odd-level, reading now
Taking space is input vector space, and memory space is the temporarily providing room opened up;If this FFT calculates process without passing through odd number
If layer calculates, space of reading now is the temporarily providing room opened up, and memory space is input vector space;In being more than
Deposit the process of ping-pong operation;
The computation model of middle four layers is similar with the computation model of first four layers, and 16 numbers are simply combined into data
Data between each unit, then in units of data block, are calculated by block.Figure three is middle four layers of computation model sketch,
In figure in units of data block, carry out between the data of each butterfly computation respective sequence the most within the data block.The figure upper left corner
Dotted line frame in, data block has been carried out simple description.
The present invention has excavated 1/4th symmetry of intermediate layer twiddle factor further;Owing to the intermediate layer of the present invention is counted
Calculate the four layers of calculating being integrated with computation structure originally so that data dependence (Date existing between former each layer
Dependence, DP), need to preserve data by substantial amounts of depositor and solve;This just brings pole to the use of depositor
Big pressure, especially for DSP based on ILP Yu DLP carries out large-scale calculations;Now can be according to twiddle factor four points
One of symmetry, the twiddle factor of the first half that the later half twiddle factor of current layer is disguised oneself as do once answer multiplication;By formula
(1) formula (3) derived with formula (1), characterizes 1/4th symmetric features that this twiddle factor is had;
The present embodiment calculates required twiddle factor for intermediate layer be respectivelyWithTime, it is achieved butterfly computation
Core code is respectively program segment 1 and program segment 2;In following procedure section, the butterfly carried out needed for r11:10 Yu r13:12 storage
Two groups of numbers of computing;R53:52 storesThe real part of butterfly coefficient and imaginary part;R15:14 is used as temporary register, and storage is multiple
The intermediate object program that number is multiplied;
Program segment 1:
Program segment 2:
Step 9, n-4 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step
8;
Step 10, by grand of simulation transmission operation, the result of calculation transposition in described temporarily providing room is reset, and reads
Twiddle factor corresponding to out-degree layer is in corresponding depositor;
Step 11, transposition is reset after result of calculation and twiddle factor corresponding to out-degree layer carry out out-degree layer butterfly meter
Calculate, obtain out-degree layer result of calculation and store in output memory headroom;Thus complete FFT floating-point optimization method.
Wherein, grand of the simulation in step 10 transmission operation is to carry out as follows:
It is grand to there is K execution in step 10.1, definition processor, and wherein, i-th execution is grand is designated as Pi;1≤i≤K;K is just
Integer;Then continuous print K row is instructed grand of the simulation as a K × K and transmits operational group;The present embodiment is grand with 4 execution
The description of process is carried out as a example by processor;Table three lists grand of the simulation transmission operational group of 4 × 4;The most as shown in Figure 2
Flow process, that simulates grand transmission operation will build grand transmission operational group of a simulation at the beginning;
Grand transmission operational group simulated by table three
Macro 1 | Macro 2 | Macro 3 | Macro 4 | |
r6 | 0 | 1 | 2 | 3 |
r7 | 4 | 5 | 6 | 7 |
r8 | 8 | 9 | 10 | 11 |
r9 | 12 | 13 | 14 | 15 |
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4, by jth row instruct in i-th perform grand PiInterior data store (i+j-1) in instructing to jth row
Mod K performs grand P(i+j-1)mod KIn;Thus by the data point reuse of grand for same execution middle different instruction row to corresponding dos command line DOS
Difference perform grand in;1≤j≤K;This step is the core process of grand of simulation transmission operation;As shown in Figure 2, this process is
The inside key operation of double-deck circulation;Its kernel program section is as follows: wherein four perform grand to identify with x, y, z and t respectively;
1. xr11:10=zr7:6 | | zr7:6=xr11:10 | | yr13:12=tr9:8 | | tr9:8=yr13:12
2. xr9:8=yr7:6 | | yr7:6=xr9:8 | | zr13:12=tr11:10 | | tr11:10=zr13:12
3. xr13:12=tr7:6 | | tr7:6=xr13:12 | | yr11:10=zr9:8 | | zr9:8=yr11:10
Step 10.5, i+1 is assigned to i;And judge whether i > K sets up, if setting up, then perform step 10.6;Otherwise,
Return step 10.4;
Step 10.6, j+1 is assigned to j;And judge whether j > K sets up, if setting up, then complete the transposition of result of calculation
Reset;Otherwise, step 10.3 is returned;Table four gives finally resets the result obtained in the present embodiment.
The result after grand transmission simulated by table four
Macro 1 | Macro 2 | Macro 3 | Macro 4 | |
r6 | 0 | 4 | 8 | 12 |
r7 | 1 | 5 | 9 | 13 |
r8 | 2 | 6 | 10 | 14 |
r9 | 3 | 7 | 11 | 15 |
Claims (2)
1. a FFT floating-point optimization method based on ILP and DLP, is characterized in that carrying out as follows:
Step 1, assume a length of M of FFT input vector to be calculated, determine that the iteration number of plies is N according to described length M;Its
Middle M=2N;M, N are positive integer, and N >=6;First four layers of definition iteration number of plies N is in-degree layer, during layer 5 to N-2 layer is
Interbed;N-1 layer and n-th layer are out-degree layer;
Step 2, the inverted sequence instruction of use position, read described FFT input vector inverted sequence in depositor, and by corresponding to in-degree layer
FFT twiddle factor be read in corresponding depositor;
Step 3, the FFT input vector being stored in depositor and FFT twiddle factor are carried out in-degree layer butterfly calculate, obtain enters
Degree layer result of calculation is stored in temporarily providing room;
Step 4, N 4 is assigned to n;
Step 5, judge whether n is odd number, the most then perform step 6, otherwise, perform step 8;
Step 6, from described temporarily providing room, read in-degree layer result of calculation and the twiddle factor corresponding to N-n+1 layer and carry out
Butterfly calculates, and obtains the covering of N-n+1 layer result of calculation and stores in input vector space;
Step 7, n-1 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8;
Step 8, read result of calculation and N-n+1 layer from described temporarily providing room to the twiddle factor corresponding to N-n+5 layer
And carry out butterfly calculating, obtain result of calculation covering and store in described temporarily providing room;
Step 9, n-4 is assigned to n;Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8;
Step 10, by grand of simulation transmission operation, the result of calculation transposition in described temporarily providing room is reset, and reads out-degree
Twiddle factor corresponding to Ceng is in corresponding depositor;
Step 11, transposition is reset after result of calculation and twiddle factor corresponding to out-degree layer carry out out-degree layer butterfly and calculate,
Obtain out-degree layer result of calculation to store in output memory headroom;Thus complete FFT floating-point optimization method.
FFT floating-point optimization method based on ILP and DLP the most according to claim 1, is characterized in that, in described step 10
Grand of simulation transmission operation be to carry out as follows:
It is grand to there is K execution in step 10.1, definition processor, and wherein, i-th execution is grand is designated as Pi;1≤i≤K;K is positive integer;
Then continuous print K row is instructed grand of the simulation as a K × K and transmits operational group;
Step 10.2, initialization j=1;
Step 10.3, initialization i=1;
Step 10.4, by jth row instruct in i-th perform grand PiInterior data store (i+j-1) mod K in instructing to jth row
The grand P of individual execution(i+j-1)modKIn;Thus by the difference of the data point reuse of grand for same execution middle different instruction row to corresponding dos command line DOS
Perform grand in;1≤j≤K;
Step 10.5, i+1 is assigned to i;And judge whether i > K sets up, if setting up, then perform step 10.6;Otherwise, return
Step 10.4;
Step 10.6, j+1 is assigned to j;And judge whether j > K sets up, if setting up, then the transposition completing result of calculation is reset;
Otherwise, step 10.3 is returned.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610473373.XA CN106095730B (en) | 2016-06-23 | 2016-06-23 | A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610473373.XA CN106095730B (en) | 2016-06-23 | 2016-06-23 | A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106095730A true CN106095730A (en) | 2016-11-09 |
CN106095730B CN106095730B (en) | 2018-10-23 |
Family
ID=57253425
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610473373.XA Active CN106095730B (en) | 2016-06-23 | 2016-06-23 | A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106095730B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101347A (en) * | 2018-07-16 | 2018-12-28 | 北京理工大学 | A kind of process of pulse-compression method of the FPGA heterogeneous computing platforms based on OpenCL |
CN109783054A (en) * | 2018-12-20 | 2019-05-21 | 中国科学院计算技术研究所 | A kind of the butterfly computation processing method and system of RSFQ fft processor |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164752A1 (en) * | 2004-08-13 | 2009-06-25 | Clearspeed Technology Plc | Processor memory system |
CN103902506A (en) * | 2014-04-16 | 2014-07-02 | 中国科学技术大学先进技术研究院 | FFTW3 optimization method based on loongson 3B processor |
CN105630737A (en) * | 2016-01-05 | 2016-06-01 | 合肥康捷信息科技有限公司 | Optimizing method of split-radix FFT (fast fourier transform) algorithm based on ternary tree |
-
2016
- 2016-06-23 CN CN201610473373.XA patent/CN106095730B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164752A1 (en) * | 2004-08-13 | 2009-06-25 | Clearspeed Technology Plc | Processor memory system |
CN103902506A (en) * | 2014-04-16 | 2014-07-02 | 中国科学技术大学先进技术研究院 | FFTW3 optimization method based on loongson 3B processor |
CN105630737A (en) * | 2016-01-05 | 2016-06-01 | 合肥康捷信息科技有限公司 | Optimizing method of split-radix FFT (fast fourier transform) algorithm based on ternary tree |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101347A (en) * | 2018-07-16 | 2018-12-28 | 北京理工大学 | A kind of process of pulse-compression method of the FPGA heterogeneous computing platforms based on OpenCL |
CN109101347B (en) * | 2018-07-16 | 2021-07-20 | 北京理工大学 | Pulse compression processing method of FPGA heterogeneous computing platform based on OpenCL |
CN109783054A (en) * | 2018-12-20 | 2019-05-21 | 中国科学院计算技术研究所 | A kind of the butterfly computation processing method and system of RSFQ fft processor |
Also Published As
Publication number | Publication date |
---|---|
CN106095730B (en) | 2018-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cao et al. | Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity | |
CN108268423B (en) | Microarchitecture implementing enhanced parallelism for sparse linear algebraic operations with write-to-read dependencies | |
CN107844322B (en) | Apparatus and method for performing artificial neural network forward operations | |
CN107341542B (en) | Apparatus and method for performing recurrent neural networks and LSTM operations | |
WO2021057746A1 (en) | Neural network processing method and apparatus, computer device and storage medium | |
CN112559051A (en) | Deep learning implementation using systolic arrays and fusion operations | |
CN104025067B (en) | With the processor for being instructed by vector conflict and being replaced the shared full connection interconnection of instruction | |
US20160283240A1 (en) | Apparatuses and methods to accelerate vector multiplication | |
CN103955447B (en) | FFT accelerator based on DSP chip | |
CN107451097B (en) | High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor | |
EP3451239A1 (en) | Apparatus and method for executing recurrent neural network and lstm computations | |
CN106933777B (en) | The high-performance implementation method of the one-dimensional FFT of base 2 based on domestic 26010 processor of Shen prestige | |
CN108431770A (en) | Hardware aspects associated data structures for accelerating set operation | |
CN103955446A (en) | DSP-chip-based FFT computing method with variable length | |
CN110163333A (en) | The parallel optimization method of convolutional neural networks | |
CN106095730A (en) | A kind of FFT floating-point optimization method based on ILP and DLP | |
CN111401537A (en) | Data processing method and device, computer equipment and storage medium | |
Li et al. | Automatic FFT performance tuning on OpenCL GPUs | |
CN113741977B (en) | Data operation method, data operation device and data processor | |
Cantó-Navarro et al. | Floating-point accelerator for biometric recognition on FPGA embedded systems | |
Mermer et al. | Efficient 2D FFT implementation on mediaprocessors | |
CN103902506A (en) | FFTW3 optimization method based on loongson 3B processor | |
Lee et al. | Large‐scale 3D fast Fourier transform computation on a GPU | |
JP3709291B2 (en) | Fast complex Fourier transform method and apparatus | |
Saybasili et al. | Highly parallel multi-dimentional fast fourier transform on fine-and coarse-grained many-core approaches |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |