CN103678255A

CN103678255A - FFT efficient parallel achieving optimizing method based on Loongson number three processor

Info

Publication number: CN103678255A
Application number: CN201310689271.8A
Authority: CN
Inventors: 顾乃杰; 江国荐; 任开新
Original assignee: HEFEI YOURUAN INFORMATION TECHNOLOGY Co Ltd
Current assignee: HEFEI COMJAY INFORMATION TECHNOLOGY CO., LTD.
Priority date: 2013-12-16
Filing date: 2013-12-16
Publication date: 2014-03-26

Abstract

The invention discloses an FFT efficient parallel achieving optimizing method based on a Loongson number three processor. The FFT efficient parallel achieving optimizing method is characterized by comprising the following steps which are carried out by means of base-2 butterfly computation. Firstly, initial parameters are set; secondly, the number of grades of FFT conversion is obtained; thirdly, all twiddle factors are obtained; fourthly, molecular vectors are divided and whether blocking processing is carried out or not is judged; fifthly, blocking processing is carried out. The problem that the speed-up ratio of an existing parallel FFT algorithm on the Loongson number three processor is low can be solved by means of the FFT efficient parallel achieving optimizing method based on the Loongson number three processor and efficient parallel of FFT on the Loongson number three processor can be achieved.

Description

A kind of FFT efficient parallel based on No. 3 processors of Godson is realized optimization method

Technical field

The invention belongs to electric Digital data processing technical field, be specifically related to FFT efficient parallel on No. 3 processors of Godson and realize optimization method.

Background technology

A domestic high performance general risc processor of developing calculates in No. 3 processor Shi You Chinese Academy of Sciences of Godson, and it is based on MIPS instruction-level collection, and has high integration, high-performance, low-power consumption and the good characteristic such as low-cost.No. 3 processors of Godson comprise four core Godson 3A processors and eight core Godson 3B processors, are mainly towards high performance machine application and high-end server.Fast fourier transform FFT (Fast Fourier Translation), be department of computer science's one of the most effective algorithm of unifying in digital display circuit application, and be widely used in the fields such as voice signal processing, image processing, power Spectral Estimation, Radar Signal Processing.Fft algorithm has computation-intensive and the intensive feature of storage, is often used as the concurrent testing benchmark of HPC, NAS.The Parallel FFT of at present practical application is owing to not doing special optimization for No. 3 processors of Godson, thereby general Parallel FFT is implanted in merely and on No. 3 processors of Godson, does not obtain operation speed-up ratio preferably.

Summary of the invention

The present invention is for avoiding the existing weak point of above-mentioned prior art, provide a kind of FFT efficient parallel based on No. 3 processors of Godson to realize optimization method, solve the situation of existing Parallel FFT low speed-up ratio on No. 3 processors of Godson, the efficient parallel that reaches FFT on No. 3 processors of Godson is realized.

The present invention adopts following scheme for solving above technical matters:

A kind of FFT efficient parallel based on No. 3 processors of Godson of the present invention is realized optimization method, is to adopt base-2 butterfly calculate and carry out as follows:

Step 1, initiation parameter is set, described initiation parameter is: the check figure p of the length N of source vector, No. 3 processors of Godson and piecemeal length N B;

Step 2, utilize formula (1) to obtain the progression S of described FFT conversion:

S＝log ₂N (1)

Step 3, utilize formula (2) to obtain each twiddle factor

W_{N}^{k} = e^{- j \frac{2 π}{N} k} - - - (2)

In formula (2),

represent k twiddle factor, k belongs to [1, N/2];

Step 4, the source vector that is N by length are divided into the subvector that p length is m;

If m is less than NB, directly carry out parallel FFT conversion, until complete S level base-2 butterfly, calculate;

If m is not less than NB, perform step 5;

Step 5, by described each subvector piecemeal, be each data block, the span of setting piecemeal length N B is [a, b], and described a is half of one-level cache memory L1-cache length; Described b is half of l2 cache memory L2-cache length;

The data block that is NB by each minute block length is assigned on each processor cores of No. 3 processors of described Godson and carries out parallel FFT conversion, until complete log ₂(p ^*nB) level base-2 butterfly is calculated, and obtains intermediate vector;

Described intermediate vector is pressed to identical twiddle factor

be divided into p dynatron vector, described dynatron vector length is m, described dynatron vector fractional integration series is fitted on each processor cores of No. 3 processors of described Godson and carries out, until complete from log ₂(p ^*nB)+1 grade is calculated to S level base-2 butterfly, thereby complete FFT efficient parallel, realizes optimization method.

Compared with prior art, beneficial effect of the present invention is embodied in:

1, the general Parallel FFT of realizing on No. 3 processors of Godson with directly transplanting is compared, and the present invention processes by deblocking, makes the data of each thread process can put into cache, has improved the hit rate of cache, thereby has promoted memory access performance.

2, because the present invention has adopted piecemeal innovative approach, make FFT parallel computation under each length, there is suitable piecemeal, the memory access performance of system significantly promotes;

3, because the present invention carries out data locality optimization to FFT parallel computation, reduce cache inefficacy, promoted the travelling speed of program.

4, through testing and verification, the maximum speed-up ratio of the present invention double-core FFT on four core Godson 3A has reached 2.49, and average speedup is 2.16; Speed-up ratio maximum in four core situations has reached 4.20, and average speedup is 3.75; On eight core Godson 3B, the maximum speed-up ratio of double-core FFT can reach 2.38, and average speedup is 2.11; Speed-up ratio maximum in four core situations can reach 3.97, and average speedup is 3.58; Speed-up ratio maximum in eight core situations can reach 7.00, and average speedup has also reached 6.14.

Embodiment

The FFT efficient parallel that the present invention is based on No. 3 processors of Godson is realized optimization method, first on Godson 3A/3B, transplants and realizes a Parallel FFT based on sharing storage programming; Then source input data vector is divided into some less subvectors, the data block size that finally makes each thread process is at one-level cache mono-half-sum secondary cache between half; Then in certain data length, piecemeal is chosen to all situations of one-level cache mono-half-sum secondary cache between half and carry out performance test, finally in any length, choose most suitable minute block size; Finally utilize data locality principle to come Parallel FFT optimization.

In the present embodiment, first on No. 3 processors of Godson, transplant the Parallel FFT based on OpenMP programming of having realized after improving; Then by source vector, according to a minute block size, be that 32*1024 divides; By piecemeal improvement project, obtain most suitable minute block size in each length again; Finally by data locality principle, to Parallel FFT, optimization improves.

It is to adopt base-2 butterfly calculate and carry out as follows that FFT efficient parallel is realized optimization method:

Step 1, No. 3 processors of Godson are on-chip multi-processors (CMP), employing be the storage organization of sharing secondary cache.For this multinuclear storage architecture, adopt OpenMP to carry out multiple programming, on No. 3 processors of Godson, realize a Parallel FFT based on shared-memory model;

First be that initiation parameter is set, initiation parameter is: the check figure p of the length N of source vector, No. 3 processors of Godson and piecemeal length N B;

Step 2, the progression S that utilizes formula (1) acquisition FFT to convert:

S＝log ₂N (1)

Step 3, utilize formula (2) to obtain each twiddle factor

W_{N}^{k} = e^{- j \frac{2 π}{N} k} - - - (2)

In formula (2),

represent k twiddle factor, k belongs to [1, N/2]; Its core code is as follows:

① #pragma omp parallel for

2. for (k=0,1 ... N/2-1) // calculate twiddle factor

③

W_{N}^{k} = e^{- j \frac{2 π}{N} k}

4. cbrev (x) // to input data carry out bit-reversed

Step 4, the source vector that is N by the length that participates in FFT conversion are divided into the subvector that p length is m, i.e. m=N/p,

If m is less than NB, source vector, without carrying out piecemeal processing, is directly carried out parallel FFT conversion, until complete S level base-2 butterfly, calculates;

If m is not less than NB, perform step 5;

Step 5, by each subvector piecemeal, be each data block, the span of setting piecemeal length N B is [a, b], and a is half of one-level cache memory L1-cache length; B is half of l2 cache memory L2-cache length;

The data block that is NB by each minute block length is assigned on each processor cores of No. 3 processors of Godson and carries out parallel FFT conversion, and making the assigned data block length of each processor core is NB, until complete log ₂(p ^*nB) level base-2 butterfly is calculated, and obtains intermediate vector; The length of intermediate vector is N, so far, the first stage that FFT calculates completes, between each thread of this stage without communication, it is relevant can to there are not a series of data in the data on the shared drive of namely processing, and the data of thread process can not be that an another thread process is crossed or be about to data to be processed.Its core code is as follows:

① #pragma omp parallel for private(L,J,k,b,B,c,i,t)

② for(L=1,...,log(m))

3. id=omp_get_thread_num () // the obtain core number of current execution

4. b=2 ^l; B=b/2; The number of c=N/b//identical twiddle factor, if c=1 represents all not identical

⑤ for(J=0,1,...,B)

6. i=(J%b) ^*c//obtain twiddle factor label

⑦ for(k=J;k<n;k=k+b)

⑧

t = x {(k + B + i d^{*} m)}^{*} W_{N}^{i}

⑨ x(k+B+id ^*m)=x(k+id ^*m)-t

⑩ x(k+id ^*m)=x(k+id ^*m)+t

Intermediate vector is pressed to identical twiddle factor

be divided into p dynatron vector, dynatron vector length is m, dynatron vector fractional integration series is fitted on each processor cores of No. 3 processors of Godson and carries out, until complete from log ₂(p ^*nB)+1 grade is calculated to S level base-2 butterfly.So far, parallel FFT calculates subordinate phase and completes, and in this stage, now the data of each thread process are exactly the data that each thread process is crossed after the first stage, and the data of every grade of processing are all different.This stage core butterfly Accounting Legend Code is identical with the first stage, during just every grade of butterfly in this stage is calculated, by same thread process, with the butterfly of twiddle factor, calculates to avoid the access repeatedly to twiddle factor, promotes cache hit rate.Thereby reach FFT efficient parallel and realize optimization method.

After Integrated using above-mentioned optimisation technique of the present invention, under No. 3 architectures of Godson, the central processing unit of 3A (CPU) work dominant frequency is 900MHZ, and the CPU work dominant frequency of 3B is 800MHZ, and testing length is N=2 ¹⁵, 2 ¹⁶..., 2 ²⁵.Experimental result shows, has adopted after the inventive method optimization, and on four core Godson 3A, the maximum speed-up ratio of double-core FFT has reached 2.49, and average speedup is 2.16; Speed-up ratio maximum in four core situations has reached 4.20, and average speedup is 3.75; On eight core Godson 3B, the maximum speed-up ratio of double-core FFT can reach 2.38, and average speedup is 2.11; Speed-up ratio maximum in four core situations can reach 3.97, and average speedup is 3.58; Speed-up ratio maximum in eight core situations can reach 7.00, and average speedup has also reached 6.14.

Claims

1. the FFT efficient parallel based on No. 3 processors of Godson is realized an optimization method, it is characterized in that, it is to adopt base-2 butterfly calculate and carry out as follows that FFT efficient parallel is realized optimization method:

S＝log ₂N (1)

Step 3, utilize formula (2) to obtain each twiddle factor

W_{N}^{k} = e^{- j \frac{2 π}{N} k} - - - (2)

In formula (2), represent k twiddle factor, k belongs to [1, N/2];

If m is not less than NB, perform step 5;

Described intermediate vector is pressed to identical twiddle factor