CN106095730A

CN106095730A - A kind of FFT floating-point optimization method based on ILP and DLP

Info

Publication number: CN106095730A
Application number: CN201610473373.XA
Authority: CN
Inventors: 顾乃杰; 任开新; 叶鸿; 周文博
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2016-06-23
Filing date: 2016-06-23
Publication date: 2016-11-09
Anticipated expiration: 2036-06-23
Also published as: CN106095730B

Abstract

The invention discloses a kind of FFT floating-point optimization method based on ILP and DLP, it is characterized in that carrying out as follows: 1, determine the iteration number of plies, and be divided into three-decker；2, by using the operations such as position inverted sequence instruction, complete in-degree layer and calculate；3, after completing the calculating of in-degree layer, the intermediate layer that will carry out is calculated and classifies, odd-level and two kinds of situations of even level are carried out computing respectively, and obtains intermediate layer result of calculation；4, use grand of simulation transmission operation, intermediate layer result of calculation is adjusted, and complete the calculating of out-degree layer.The present invention can solve the relevant problem limited with structure of the instruction in the presence of annual reporting law, and gives full play to arithmetic unit load usefulness, thus increases substantially the average utilization of bottleneck.

Description

A kind of FFT floating-point optimization method based on ILP and DLP

Technical field

The invention belongs to vector processor and digital processing field, be specifically related to hardware based on ILP and DLP and put down The method that on platform, floating-point version FFT realizes efficiently calculating.

Background technology

Discrete Fourier transform (Discrete Fourier Transform, DFT) is at modern signal processing system regions In be widely used, such as Radar Signal Processing, SAR image process, sonar calculating, video image algorithm, spectrum analysis, speech recognition Deng.Fourier's change calculations is typical computation-intensive and memory access intensive applications, and the calculating of the DFT transform of such as N point is complicated Degree is O (N²).A kind of fast Fourier transform of nineteen sixty-five Cooley and Turkey proposition (Fast Fourier Transform, FFT) computational methods, can significantly decrease operand, and computation complexity is by original O (N²) fall below O (Nlog₂N).At signal Ought to be with generally the highest to the requirement of real-time calculated, FFT computational efficiency is the highest, and the real-time of signal processing is the best.

Instruction level parallelism (Instruction Level Parallelism, ILP) is that finger processor is in same instruction week The instruction of a plurality of executed in parallel is launched in phase.Data level parallel (Data Level Parallelism, DLP) refers to same One moment carried out the architecture of parallel computation to different pieces of information.Hardware platform based on ILP Yu DLP the most all can use VLIW With SIMD technology so that it is large-scale efficient computing can be carried out.

The hardware platform combined due to ILP with DLP technology is complex, the research of fast Fourier transform based on it Do not launched.

Summary of the invention

The present invention is the weak point overcoming prior art to exist, and proposes a kind of FFT floating-point optimization based on ILP and DLP Method, is concerned with and the restriction of structure to solving the order of annual reporting law middle finger, and gives full play to arithmetic unit load usefulness, thus significantly Degree improves the average utilization of bottleneck.

In order to solve above-mentioned technical problem, the present invention by the following technical solutions:

The feature of a kind of FFT floating-point optimization method based on ILP and DLP of the present invention is to carry out as follows:

Step 1, assume a length of M of FFT input vector to be calculated, determine that the iteration number of plies is according to described length M N；Wherein M=2^N；M, N are positive integer, and N >=6；First four layers of definition iteration number of plies N is in-degree layer, and layer 5 is to N-2 layer For intermediate layer；N-1 layer and n-th layer are out-degree layer；

Step 2, the inverted sequence instruction of use position, read described FFT input vector inverted sequence in depositor, and by in-degree layer institute Corresponding FFT twiddle factor is read in corresponding depositor；

Step 3, the FFT input vector being stored in depositor and FFT twiddle factor are carried out in-degree layer butterfly calculate, obtain In-degree layer result of calculation be stored in temporarily providing room；

Step 4, N 4 is assigned to n；

Step 5, judge whether n is odd number, the most then perform step 6, otherwise, perform step 8；

Step 6, from described temporarily providing room read in-degree layer result of calculation with corresponding to N-n+1 layer twiddle factor also Carry out butterfly calculating, obtain the covering of N-n+1 layer result of calculation and store in input vector space；

Step 7, n-1 is assigned to n；Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8；

Step 8, read result of calculation and N-n+1 layer from described temporarily providing room to the rotation corresponding to N-n+5 layer The factor also carries out butterfly calculating, obtains result of calculation covering and stores in described temporarily providing room；

Step 9, n-4 is assigned to n；Judge whether n=2 sets up, if setting up, then perform step 10, otherwise, perform step 8；

Step 10, by grand of simulation transmission operation, the result of calculation transposition in described temporarily providing room is reset, and reads Twiddle factor corresponding to out-degree layer is in corresponding depositor；

Step 11, transposition is reset after result of calculation and twiddle factor corresponding to out-degree layer carry out out-degree layer butterfly meter Calculate, obtain out-degree layer result of calculation and store in output memory headroom；Thus complete FFT floating-point optimization method.

The feature of FFT floating-point optimization method based on ILP and DLP of the present invention lies also in,

Grand of simulation in described step 10 transmission operation is to carry out as follows:

It is grand to there is K execution in step 10.1, definition processor, and wherein, i-th execution is grand is designated as P_i；1≤i≤K；K is just Integer；Then continuous print K row is instructed grand of the simulation as a K × K and transmits operational group；

Step 10.2, initialization j=1；

Step 10.3, initialization i=1；

Step 10.4, by jth row instruct in i-th perform grand P_iInterior data store (i+j-1) in instructing to jth row Mod K performs grand P_{(i+j-1)mod K}In；Thus by the data point reuse of grand for same execution middle different instruction row to corresponding dos command line DOS Difference perform grand in；1≤j≤K；

Step 10.5, i+1 is assigned to i；And judge whether i ＞ K sets up, if setting up, then perform step 10.6；Otherwise, Return step 10.4；

Step 10.6, j+1 is assigned to j；And judge whether j ＞ K sets up, if setting up, then complete the transposition of result of calculation Reset；Otherwise, step 10.3 is returned.

Compared with the prior art, the present invention has the beneficial effect that:

1, the present invention proposes a kind of new floating-point version FFT optimization method, to adapt to the feature of ILP Yu DLP hardware platform, By adjusting base two Cooley-Tukey algorithm structure, compress it and calculate while number of plies, use grand of simulation transmission operation, interior Deposit the technology such as ping-pong operation and cache operations, to hardware platform based on ILP Yu DLP technology, carry out fast Fourier change The efficient deployment changed；Effectively reduce operation clock expense, thus improve hardware platform for fast Fourier transform meter The efficiency calculated；

2, use three layers of computing structure model due to the present invention so that the calculating of multiple structure originally, become three layers；From And decrease the content of registers refreshing that between ectonexine circulation, scheduling is caused and empty caused clock expense with streamline；

3, owing to present invention employs internal memory ping-pong operation, reading and peek originally is made to be stored among a memory block, Divide into two pieces of table tennis internal memories to store；Thus avoid and internal memory is read while write to the clock expense caused, improve meter Calculate efficiency；

4, the present invention simulates grand transmission operation is the difference using parallel instructions technology to be caused by data level concurrent technique Perform the data in sub-clustering, be adjusted among identical execution sub-clustering, to ensure follow-up calculating；In this operation effectively avoids Deposit Bank conflict, and improve each efficiency performing grand data point reuse；

5, the present invention further excavates the symmetry of butterfly coefficient, and decrease butterfly coefficient in computing prefetches number, To reach to reduce the purpose that depositor uses；This operation can reduce the twiddle factor of nearly half, uses sky reducing internal memory While between, decrease depositor by the number shared by twiddle factor；

6, through experimental verification, the inventive method is in 32 floating-point version complex Fourier transform, defeated to its 1024 The computing entered successfully will be compressed to 980 the clock cycle；Bottleneck functional part utilization rate in each layer computation structure reaches respectively To 96.68%, 98.25% and 100%.

Accompanying drawing explanation

Fig. 1 is the general flow chart of the present invention；

Fig. 2 is to simulate grand transmission operational flowchart in the present invention；

Fig. 3 is that intermediate layer of the present invention calculates four layer models used.

Detailed description of the invention

The purpose of the present invention is to propose to the optimization method of a kind of floating-point version FFT being applicable to ILP and DLP hardware platform, To high performance optimization can be carried out on its hardware infrastructure provided.Following detailed description of the invention only with BWDSP104x platform is optimized the discussion of method as an example, but optimisation technique and method not only limit in the present invention In BWDSP104x platform.The hardware platform of any ILP and DLP is suitable in the prioritization scheme of the present invention.

BWDSP104x platform have 4 execution grand (x, y, z, t), each grand in have 8 ALUs (ALU), 8 Individual multiplier (MUL), 4 shift units (SHIFT), 1 surpasses calculation device and one group of general purpose register set comprising 128 depositors. It has 11 level production lines, and each dos command line DOS at most can the most parallel 16 word instructions.

In the present embodiment, a kind of FFT floating-point optimization method based on ILP and DLP is to carry out as follows:

Step 1, assume a length of M of FFT input vector to be calculated, determine that the iteration number of plies is according to described length M N；Illustrating as a example by input vector a length of 1024 in the present embodiment, other length can be implemented by similar scheme； Wherein M=2^N；M, N are positive integer, and N >=6；M=10 now, N=1024；First four layers of definition iteration number of plies N is in-degree Layer, layer 5 to N-2 layer is intermediate layer；N-1 layer and n-th layer are out-degree layer；Figure one calculates the flow process of process for this FFT Figure, in figure, 1-4 step depicts in-degree layer calculating process, 5-7 step depicts intermediate layer calculating process, 8-10 step depicts Out-degree layer calculates process；

Step 2, the inverted sequence instruction of use position, read described FFT input vector inverted sequence in depositor, and by in-degree layer institute Corresponding FFT twiddle factor is read in corresponding depositor；Table one is 4 and performs grand digital signal processor, uses position After data are read in inverted sequence instruction, the data of storage in each depositor.The difference that same execution is grand is result in by its instruction feature The data that depositor is read are inverted sequence, and grand the read data of the difference of same depositor execution are order.Table two is listed The details of twiddle factor needed for in-degree layer.Twiddle factor from table two it will be seen that needed for in-degree layer, can come by three numbers Replace, respectively: cos (π/4), sin (π/8) and cos (π/8).

The data (its order in array of digitized representation in table) that one inverted sequence of table reads

	x	y	z	t
					r7:6	0	1	2	3
r9:8	512	513	514	515
					r11:10	256	257	258	259
r13:12	768	769	770	771

The front four layers of twiddle factor of table two

Step 3, the FFT input vector being stored in depositor and FFT twiddle factor are carried out in-degree layer butterfly calculate, obtain In-degree layer result of calculation be stored in temporarily providing room；The purpose opening up temporarily providing room is to ensure that it enters with input vector space Row internal memory ping-pong operation so that calculating process can complete read-write operation flowing full water when simultaneously；

Step 4, N 4 is assigned to n；

Step 8, read result of calculation and N-n+1 layer from described temporarily providing room to the rotation corresponding to N-n+5 layer The factor also carries out butterfly calculating, obtains result of calculation covering and stores in described temporarily providing room；Described temporary sky in this step Between to discuss respectively according to different disposal process；If if this FFT calculating process needs to be calculated by odd-level, reading now Taking space is input vector space, and memory space is the temporarily providing room opened up；If this FFT calculates process without passing through odd number If layer calculates, space of reading now is the temporarily providing room opened up, and memory space is input vector space；In being more than Deposit the process of ping-pong operation；

The computation model of middle four layers is similar with the computation model of first four layers, and 16 numbers are simply combined into data Data between each unit, then in units of data block, are calculated by block.Figure three is middle four layers of computation model sketch, In figure in units of data block, carry out between the data of each butterfly computation respective sequence the most within the data block.The figure upper left corner Dotted line frame in, data block has been carried out simple description.

The present invention has excavated 1/4th symmetry of intermediate layer twiddle factor further；Owing to the intermediate layer of the present invention is counted Calculate the four layers of calculating being integrated with computation structure originally so that data dependence (Date existing between former each layer Dependence, DP), need to preserve data by substantial amounts of depositor and solve；This just brings pole to the use of depositor Big pressure, especially for DSP based on ILP Yu DLP carries out large-scale calculations；Now can be according to twiddle factor four points One of symmetry, the twiddle factor of the first half that the later half twiddle factor of current layer is disguised oneself as do once answer multiplication；By formula (1) formula (3) derived with formula (1), characterizes 1/4th symmetric features that this twiddle factor is had；

The present embodiment calculates required twiddle factor for intermediate layer be respectivelyWithTime, it is achieved butterfly computation Core code is respectively program segment 1 and program segment 2；In following procedure section, the butterfly carried out needed for r11:10 Yu r13:12 storage Two groups of numbers of computing；R53:52 storesThe real part of butterfly coefficient and imaginary part；R15:14 is used as temporary register, and storage is multiple The intermediate object program that number is multiplied；

Program segment 1:

Program segment 2:

Wherein, grand of the simulation in step 10 transmission operation is to carry out as follows:

It is grand to there is K execution in step 10.1, definition processor, and wherein, i-th execution is grand is designated as P_i；1≤i≤K；K is just Integer；Then continuous print K row is instructed grand of the simulation as a K × K and transmits operational group；The present embodiment is grand with 4 execution The description of process is carried out as a example by processor；Table three lists grand of the simulation transmission operational group of 4 × 4；The most as shown in Figure 2 Flow process, that simulates grand transmission operation will build grand transmission operational group of a simulation at the beginning；

Grand transmission operational group simulated by table three

	Macro 1	Macro 2	Macro 3	Macro 4
					r6	0	1	2	3
r7	4	5	6	7
					r8	8	9	10	11
r9	12	13	14	15

Step 10.2, initialization j=1；

Step 10.3, initialization i=1；

Step 10.4, by jth row instruct in i-th perform grand P_iInterior data store (i+j-1) in instructing to jth row Mod K performs grand P_{(i+j-1)mod K}In；Thus by the data point reuse of grand for same execution middle different instruction row to corresponding dos command line DOS Difference perform grand in；1≤j≤K；This step is the core process of grand of simulation transmission operation；As shown in Figure 2, this process is The inside key operation of double-deck circulation；Its kernel program section is as follows: wherein four perform grand to identify with x, y, z and t respectively；

Step 10.6, j+1 is assigned to j；And judge whether j ＞ K sets up, if setting up, then complete the transposition of result of calculation Reset；Otherwise, step 10.3 is returned；Table four gives finally resets the result obtained in the present embodiment.

The result after grand transmission simulated by table four

	Macro 1	Macro 2	Macro 3	Macro 4
					r6	0	4	8	12
r7	1	5	9	13
					r8	2	6	10	14
r9	3	7	11	15

Claims

1. a FFT floating-point optimization method based on ILP and DLP, is characterized in that carrying out as follows:

Step 1, assume a length of M of FFT input vector to be calculated, determine that the iteration number of plies is N according to described length M；Its Middle M=2^N；M, N are positive integer, and N >=6；First four layers of definition iteration number of plies N is in-degree layer, during layer 5 to N-2 layer is Interbed；N-1 layer and n-th layer are out-degree layer；

Step 2, the inverted sequence instruction of use position, read described FFT input vector inverted sequence in depositor, and by corresponding to in-degree layer FFT twiddle factor be read in corresponding depositor；

Step 3, the FFT input vector being stored in depositor and FFT twiddle factor are carried out in-degree layer butterfly calculate, obtain enters Degree layer result of calculation is stored in temporarily providing room；

Step 4, N 4 is assigned to n；

Step 6, from described temporarily providing room, read in-degree layer result of calculation and the twiddle factor corresponding to N-n+1 layer and carry out Butterfly calculates, and obtains the covering of N-n+1 layer result of calculation and stores in input vector space；

Step 8, read result of calculation and N-n+1 layer from described temporarily providing room to the twiddle factor corresponding to N-n+5 layer And carry out butterfly calculating, obtain result of calculation covering and store in described temporarily providing room；

Step 10, by grand of simulation transmission operation, the result of calculation transposition in described temporarily providing room is reset, and reads out-degree Twiddle factor corresponding to Ceng is in corresponding depositor；

Step 11, transposition is reset after result of calculation and twiddle factor corresponding to out-degree layer carry out out-degree layer butterfly and calculate, Obtain out-degree layer result of calculation to store in output memory headroom；Thus complete FFT floating-point optimization method.

FFT floating-point optimization method based on ILP and DLP the most according to claim 1, is characterized in that, in described step 10 Grand of simulation transmission operation be to carry out as follows:

It is grand to there is K execution in step 10.1, definition processor, and wherein, i-th execution is grand is designated as P_i；1≤i≤K；K is positive integer； Then continuous print K row is instructed grand of the simulation as a K × K and transmits operational group；

Step 10.2, initialization j=1；

Step 10.3, initialization i=1；

Step 10.4, by jth row instruct in i-th perform grand P_iInterior data store (i+j-1) mod K in instructing to jth row The grand P of individual execution_(i+j-1)modKIn；Thus by the difference of the data point reuse of grand for same execution middle different instruction row to corresponding dos command line DOS Perform grand in；1≤j≤K；

Step 10.6, j+1 is assigned to j；And judge whether j ＞ K sets up, if setting up, then the transposition completing result of calculation is reset； Otherwise, step 10.3 is returned.