CN106095730B

CN106095730B - A kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level

Info

Publication number: CN106095730B
Application number: CN201610473373.XA
Authority: CN
Inventors: 顾乃杰; 任开新; 叶鸿; 周文博
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2016-06-23
Filing date: 2016-06-23
Publication date: 2018-10-23
Anticipated expiration: 2036-06-23
Also published as: CN106095730A

Abstract

The invention discloses the FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level a kind of, it is characterized in that carrying out as follows：1, determine the iteration number of plies, and be divided into three-decker；2, by using operations such as position inverted sequence instructions, completes in-degree layer and calculate；3, after completing the calculating of in-degree layer, classifies to the middle layer calculating that will be carried out, operation respectively is carried out to odd-level and two kinds of situations of even level, and obtain middle layer result of calculation；4, using macro transmission operation of simulation, middle layer result of calculation is adjusted, and complete the calculating of out-degree layer.The present invention can solve the instruction in the presence of annual reporting law relevant the problem of being limited with structure, and give full play to arithmetic unit load efficiency, to increase substantially the average utilization of bottleneck.

Description

A kind of FFT floating-points optimization of the Parallel I of the grade based on instruction LP and parallel DLP of data level Method

Technical field

The invention belongs to vector processor and digital processing fields, and in particular to the hardware based on ILP and DLP is flat Floating-point version FFT realizes the method efficiently calculated on platform.

Background technology

Discrete Fourier transform (Discrete Fourier Transform, DFT) is in modern signal processing system regions In be widely used, such as Radar Signal Processing, SAR image processing, sonar calculating, video image algorithm, spectrum analysis, speech recognition Deng.It is typical computation-intensive and memory access intensive applications, such as the calculating complexity of the DFT transform of N points that Fourier, which changes calculating, Degree is O (N²).A kind of nineteen sixty-five Fast Fourier Transform (FFT) of Cooley and Turkey proposition (Fast Fourier Transform, FFT) computational methods can significantly decrease operand, and computation complexity is by original O (N²) fall below O (Nlog₂N).At signal Ought to use it is usually very high to the requirement of real-time of calculating, FFT computational efficiencies are higher, and the real-time of signal processing is better.

Instruction level parallelism (Instruction Level Parallelism, ILP) is finger processor in same instruction week The a plurality of instruction executed parallel of transmitting in phase.Data level parallel (Data Level Parallelism, DLP) is referred to same One moment carried out different data the architecture of parallel computation.Hardware platform based on ILP and DLP usually can all use VLIW With SIMD technologies, large-scale efficient operation can be carried out.

Since the hardware platform that ILP is combined with DLP technologies is complex, the research based on its Fast Fourier Transform (FFT) It is not unfolded.

Invention content

The present invention is to propose a kind of grade Parallel I LP and data level based on instruction in place of overcoming the shortcomings of the prior art The FFT floating-point optimization methods of parallel DLP enable relevant and structure limitation to solve annual reporting law middle finger, and give full play to operation Component load efficiency, to increase substantially the average utilization of bottleneck.

In order to solve the above-mentioned technical problem, the present invention uses following technical scheme：

A kind of the characteristics of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level of the present invention be by Following steps carry out：

Step 1 assumes that the length for the FFT input vectors to be calculated is M, determines that the iteration number of plies is according to the length M N；Wherein M=2^N；M, N is positive integer, and N >=6；Define iteration number of plies N first four layers are in-degree layer, and layer 5 is to N-2 layers For middle layer；N-1 layers are out-degree layer with n-th layer；

Step 2 is instructed using position inverted sequence, and the FFT input vectors inverted sequence is read in register, and by in-degree layer institute Corresponding FFT twiddle factors are read into corresponding register；

Step 3 carries out the calculating of in-degree layer butterfly to the FFT input vectors and FFT twiddle factors that are stored in register, obtains In-degree layer result of calculation deposit temporarily providing room in；

N -4 is assigned to n by step 4；

Step 5 judges whether n is odd number, if so, thening follow the steps 6, otherwise, executes step 8；

Step 6, from the temporarily providing room read in-degree layer result of calculation with N-n+1 layers corresponding to twiddle factor simultaneously Butterfly calculating is carried out, is obtained in N-n+1 layers of result of calculation covering storage to input vector space；

N-1 is assigned to n by step 7；Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8；

Step 8 reads the rotation corresponding to result of calculation and N-n+1 layers to N-n+5 layers from the temporarily providing room The factor simultaneously carries out butterfly calculating, obtains in result of calculation covering storage to the temporarily providing room；

N-4 is assigned to n by step 9；Judge whether n=2 is true, if so, 10 are thened follow the steps, otherwise, executes step 8；

Step 10 is operated by simulating macro transmission, and the result of calculation transposition in the temporarily providing room is reset, and reads In twiddle factor to corresponding register corresponding to out-degree layer；

Step 11 carries out out-degree layer butterfly meter to the result of calculation after transposition rearrangement and the twiddle factor corresponding to out-degree layer It calculates, obtains in out-degree layer result of calculation storage to output memory headroom；To complete FFT floating-point optimization methods.

The characteristics of FFT floating-point optimization methods of the present invention based on ILP and DLP, lies also in,

Macro transmission operation of simulation in the step 10 is to carry out as follows：

There are K execution is macro for step 10.1, definition processor, wherein i-th of execution is macro to be denoted as P_i；1≤i≤K；K is just Integer；Continuous K rows are then instructed to macro transmission operational group of simulation as a K × K；

Step 10.2, initialization j=1；

Step 10.3, initialization i=1；

Step 10.4 executes macro P by i-th in the instruction of jth row_iInterior data are stored into the instruction of jth row (i+j-1) Mod K execute macro P_{(i+j-1)mod K}It is interior；Thus by the same data point reuse for executing macro middle different instruction row to corresponding instruction row Difference execute it is macro in；1≤j≤K；

I+1 is assigned to i by step 10.5；And judge whether i > K are true, if so, then follow the steps 10.6；Otherwise, Return to step 10.4；

J+1 is assigned to j by step 10.6；And judge whether j > K are true, if so, then complete the transposition of result of calculation It resets；Otherwise, return to step 10.3.

Compared with the prior art, the present invention has the beneficial effect that：

1, the present invention proposes a kind of new floating-point version FFT optimization methods, to adapt to the characteristics of ILP is with DLP hardware platforms, By adjusting two Cooley-Tukey algorithm structures of base, while compressing its calculating number of plies, is operated using macro transmission of simulation, is interior The technologies such as ping-pong operation and cache operations are deposited, to the hardware platform based on ILP Yu DLP technologies, carry out fast Fourier change The efficient deployment changed；Operation clock expense is effectively reduced, to improve hardware platform for Fast Fourier Transform (FFT) meter The efficiency of calculation；

2, since the present invention uses three layers of computing structure model so that the calculating of script multilayered structure becomes three layers；From And reduces the content of registers refreshing caused by being dispatched between ectonexine recycles and empty caused clock expense with assembly line；

3, it is stored among a memory block with access due to present invention employs memory ping-pong operation, making to read originally, Two pieces of table tennis memories are divided into be stored；Clock expense caused by so as to avoid being read while write for memory improves meter Calculate efficiency；

4, macro transmission operation of present invention simulation is to use parallel instructions technology different caused by data level concurrent technique The data in sub-clustering are executed, are adjusted among identical execution sub-clustering, to ensure subsequent calculating；In the operation effectively avoids Bank conflicts are deposited, and improve each efficiency for executing macro data point reuse；

5, the further symmetry for excavating butterfly coefficient of the present invention, reduce butterfly coefficient in operation prefetches number, It is used with achieving the purpose that reduce register；The operation can reduce the twiddle factor of nearly half, and sky is used reducing memory Between while, reduce register by the occupied number of twiddle factor；

6, by experimental verification, the method for the present invention is defeated to its 1024 points in 32 floating-point version complex Fourier transforms The operation entered successfully will be compressed to 980 the clock cycle；The bottleneck functional component utilization rate that each layer calculates in structure reaches respectively To 96.68%, 98.25% and 100%.

Description of the drawings

Fig. 1 is the general flow chart of the present invention；

Fig. 2 is that macro transmission operational flowchart is simulated in the present invention；

Fig. 3 is that middle layer of the present invention calculates four layer models used.

Specific implementation mode

The purpose of the present invention is to propose to a kind of suitable for the floating of instruction level parallelism ILP and the parallel DLP hardware platforms of data level The optimization method of point version FFT, to which high performance optimization can be carried out on the hardware infrastructure of its offer.Following tools Body embodiment only optimizes the discussion of method using BWDSP104x platforms as example, however in the present invention optimisation technique and Method is not limited merely to BWDSP104x platforms.The hardware platform of any ILP and DLP is suitable for the prioritization scheme of the present invention In.

It is macro (x, y, z, t) that BWDSP104x platforms possess 4 execution, it is each it is macro in have 8 arithmetic logic unit (ALU), 8 A multiplier (MUL), 4 shift units (SHIFT), 1 super calculation device and one group of general register group for including 128 registers. It shares 11 level production lines, and each dos command line DOS at most can parallel 16 word instruction simultaneously.

In the present embodiment, a kind of FFT floating-point optimization methods of the Parallel I of the grade based on instruction LP and parallel DLP of data level be by Following steps carry out：

Step 1 assumes that the length for the FFT input vectors to be calculated is M, determines that the iteration number of plies is according to the length M N；It is illustrated so that input vector length is 1024 as an example in the present embodiment, other length can be implemented by similar scheme； Wherein M=2^N；M, N is positive integer, and N >=6；M=10 at this time, N=1024；Define iteration number of plies N first four layers are in-degree Layer, layer 5 are middle layer to N-2 layers；N-1 layers are out-degree layer with n-th layer；Fig. 1 is the flow chart of this FFT calculating process, 1-4 steps depict in-degree layer calculating process in figure, 5-7 steps depict middle layer calculating process, 8-10 steps depict out Spend layer calculating process；

Step 2 is instructed using position inverted sequence, and the FFT input vectors inverted sequence is read in register, and by in-degree layer institute Corresponding FFT twiddle factors are read into corresponding register；Table one is 4 and executes macro digital signal processor, using position After data are read in inverted sequence instruction, the data that are stored in each register.The macro difference of same execution is resulted in by its instruction feature The read data of register are inverted sequence, and it is sequence that the difference of same register, which executes macro read data,.Table two is listed The details of twiddle factor needed for in-degree layer.From table two it can be seen that, twiddle factor needed for in-degree layer, can with three numbers come Instead of being respectively：Cos (π/4), sin (π/8) and cos (π/8).

The data that one inverted sequence of table is read (number represents its sequence in array in table)

	x	y	z	t
					r7:6	0	1	2	3
r9:8	512	513	514	515
					r11:10	256	257	258	259
r13:12	768	769	770	771

Two preceding four layers of twiddle factor of table

Step 3 carries out the calculating of in-degree layer butterfly to the FFT input vectors and FFT twiddle factors that are stored in register, obtains In-degree layer result of calculation deposit temporarily providing room in；The purpose for opening up temporarily providing room be in order to ensure its with input vector space into Row memory ping-pong operation so that calculating process can be completed at the same time read-write operation when flowing full water；

N -4 is assigned to n by step 4；

Step 8 reads the rotation corresponding to result of calculation and N-n+1 layers to N-n+5 layers from the temporarily providing room The factor simultaneously carries out butterfly calculating, obtains in result of calculation covering storage to the temporarily providing room；The temporary sky in this step Between to be discussed respectively according to different disposal process；If this FFT calculating process needs if being calculated by odd-level, reading at this time It is input vector space to take space, and memory space is the temporarily providing room opened up；If this FFT calculating process is without passing through odd number If layer calculates, reading space at this time is the temporarily providing room opened up, and memory space is input vector space；Above is interior Deposit the process of ping-pong operation；

Intermediate four layers of computation model is similar with preceding four layers of computation model, and 16 numbers are only combined into a data Block, then as unit of data block, the data between each unit calculate.Fig. 3 is intermediate four layers of computation model schematic diagram, figure In as unit of data block, each butterfly computation carries out between the data of respective sequence within the data block respectively.The figure upper left corner In dotted line frame, simple description has been carried out to data block.

The present invention has further excavated a quarter symmetry of middle layer twiddle factor；Due to the middle layer meter of the present invention It calculates and is integrated with four layers of calculating for calculating structure originally so that existing data dependence (Date between former each layer Dependence, DP), it needs to preserve data by a large amount of register to solve；This just brings pole to the use of register Big pressure, especially for the DSP based on ILP and DLP carries out large-scale calculations；It can be divided at this time according to twiddle factor four One of symmetry, by the later half twiddle factor of current layer disguise oneself as the first half twiddle factor do once again multiplication；By formula (1) formula (3) derived with formula (1), the characteristics of characterizing a quarter symmetry possessed by the twiddle factor；

Calculating required twiddle factor for middle layer in the present embodiment is respectivelyWithWhen, realize butterfly computation Core code is respectively program segment 1 and program segment 2；In following procedure section, r11:10 and r13:The butterfly carried out needed for 12 storages Two groups of numbers of operation；r53:52 storagesThe real part and imaginary part of butterfly coefficient；r15:14 are used as temporary register, and storage is multiple The intermediate result that number is multiplied；

Program segment 1：

①cfr11:10=cfr11:10*fr53

②||cfr15:14=cfr11:10*fr52 // temporal data and twiddle factor

3. fr11=fr11-fr14

④||The calculating of real part imaginary part after fr10=fr10+fr15 // carry out complex multiplication.

⑤cfr13:12_11:10=cfr13:12+/–cfr11:10 // carry out butterfly computation

Program segment 2：

①cfr11:10=cfr11:10*fr53

②||cfr15:14=cfr11:10*fr52 // temporal data and twiddle factor

3. fr11=fr11-fr14

⑤cfr13:12=cfr13:12–jcfr11:10

⑥||cfr11:10=cfr13:12+jcfr11:10 // carry out butterfly computation

Wherein, macro transmission operation of simulation in step 10 is to carry out as follows：

There are K execution is macro for step 10.1, definition processor, wherein i-th of execution is macro to be denoted as P_i；1≤i≤K；K is just Integer；Continuous K rows are then instructed to macro transmission operational group of simulation as a K × K；The present embodiment is macro with 4 execution The description of process is carried out for processor；4 × 4 macro transmission operational group of simulation is listed in table three；Stream as shown in Figure 2 Operational group is transmitted in journey, macro of the simulation that build at the very start for simulating macro transmission operation；

Table three simulates macro transmission operational group

	Macro 1	Macro 2	Macro 3	Macro 4
					r6	0	1	2	3
r7	4	5	6	7
					r8	8	9	10	11
r9	12	13	14	15

Step 10.2, initialization j=1；

Step 10.3, initialization i=1；

Step 10.4 executes macro P by i-th in the instruction of jth row_iInterior data are stored into the instruction of jth row (i+j-1) Mod K execute macro P_{(i+j-1)mod K}It is interior；Thus by the same data point reuse for executing macro middle different instruction row to corresponding instruction row Difference execute it is macro in；1≤j≤K；This step is to simulate the core process of macro transmission operation；As shown in Fig. 2, this process is The inside key operation of bilayer cycle；Its kernel program section is as follows：Wherein four are executed and macro are identified respectively with x, y, z and t；

①xr11:10=zr7:6||zr7:6=xr11:10||yr13:12=tr9:8||tr9:8=yr13:12

②xr9:8=yr7:6||yr7:6=xr9:8||zr13:12=tr11:10||tr11:10=zr13:12

③xr13:12=tr7:6||tr7:6=xr13:12||yr11:10=zr9:8||zr9:8=yr11:10

J+1 is assigned to j by step 10.6；And judge whether j > K are true, if so, then complete the transposition of result of calculation It resets；Otherwise, return to step 10.3；Table four gives the result finally reset in the present embodiment.

Table four simulates the result after macro transmission

	Macro 1	Macro 2	Macro 3	Macro 4
					r6	0	4	8	12
r7	1	5	9	13
					r8	2	6	10	14
r9	3	7	11	15

Claims

1. the FFT floating-point optimization methods of a kind of Parallel I of the grade based on instruction LP and parallel DLP of data level, it is characterized in that by following step It is rapid to carry out：

Step 1 assumes that the length for the FFT input vectors to be calculated is M, determines that the iteration number of plies is N according to the length M；Its Middle M=2^N；M, N is positive integer, and N >=6；Define iteration number of plies N first four layers are in-degree layer, during layer 5 is to N-2 layers Interbed；N-1 layers are out-degree layer with n-th layer；

Step 2 is instructed using position inverted sequence, and the FFT input vectors inverted sequence is read in register, and will be corresponding to in-degree layer FFT twiddle factors be read into corresponding register；

Step 3 carries out the calculating of in-degree layer butterfly to the FFT input vectors and FFT twiddle factors that are stored in register, and what is obtained enters It spends in layer result of calculation deposit temporarily providing room；

N -4 is assigned to n by step 4；

Step 6, from the temporarily providing room read in-degree layer result of calculation with N-n+1 layers corresponding to twiddle factor and progress Butterfly calculates, and obtains in N-n+1 layers of result of calculation covering storage to input vector space；

Step 8 reads the twiddle factor corresponding to result of calculation and N-n+1 layers to N-n+5 layers from the temporarily providing room And butterfly calculating is carried out, it obtains in result of calculation covering storage to the temporarily providing room；

Step 10 is operated by simulating macro transmission, the result of calculation transposition in the temporarily providing room is reset, and read out-degree In twiddle factor to corresponding register corresponding to layer；

There are K execution is macro for step 10.1, definition processor, wherein i-th of execution is macro to be denoted as P_i；1≤i≤K；K is positive integer； Continuous K rows are then instructed to macro transmission operational group of simulation as a K × K；

Step 10.2, initialization j=1；

Step 10.3, initialization i=1；

Step 10.4 executes macro P by i-th in the instruction of jth row_iInterior data store (i+j-1) modK into the instruction of jth row It is a to execute macro P_(i+j-1)modKIt is interior；Thus by the difference of the same data point reuse for executing macro middle different instruction row to corresponding instruction row Execute it is macro in；1≤j≤K；

I+1 is assigned to i by step 10.5；And judge whether i > K are true, if so, then follow the steps 10.6；Otherwise, it returns Step 10.4；

J+1 is assigned to j by step 10.6；And judge whether j > K are true, if so, the transposition for then completing result of calculation is reset； Otherwise, return to step 10.3；

Step 11 carries out the calculating of out-degree layer butterfly to the result of calculation after transposition rearrangement and the twiddle factor corresponding to out-degree layer, It obtains in out-degree layer result of calculation storage to output memory headroom；To complete FFT floating-point optimization methods.