Summary of the invention
The present invention is for avoiding the existing weak point of above-mentioned prior art, provide a kind of FFT efficient parallel based on No. 3 processors of Godson to realize optimization method, solve the situation of existing Parallel FFT low speed-up ratio on No. 3 processors of Godson, the efficient parallel that reaches FFT on No. 3 processors of Godson is realized.
The present invention adopts following scheme for solving above technical matters:
A kind of FFT efficient parallel based on No. 3 processors of Godson of the present invention is realized optimization method, is to adopt base-2 butterfly calculate and carry out as follows:
Step 1, initiation parameter is set, described initiation parameter is: the check figure p of the length N of source vector, No. 3 processors of Godson and piecemeal length N B;
Step 2, utilize formula (1) to obtain the progression S of described FFT conversion:
S=log
2N (1)
Step 3, utilize formula (2) to obtain each twiddle factor
In formula (2),
represent k twiddle factor, k belongs to [1, N/2];
Step 4, the source vector that is N by length are divided into the subvector that p length is m;
If m is less than NB, directly carry out parallel FFT conversion, until complete S level base-2 butterfly, calculate;
If m is not less than NB, perform step 5;
Step 5, by described each subvector piecemeal, be each data block, the span of setting piecemeal length N B is [a, b], and described a is half of one-level cache memory L1-cache length; Described b is half of l2 cache memory L2-cache length;
The data block that is NB by each minute block length is assigned on each processor cores of No. 3 processors of described Godson and carries out parallel FFT conversion, until complete log
2(p
*nB) level base-2 butterfly is calculated, and obtains intermediate vector;
Described intermediate vector is pressed to identical twiddle factor
be divided into p dynatron vector, described dynatron vector length is m, described dynatron vector fractional integration series is fitted on each processor cores of No. 3 processors of described Godson and carries out, until complete from log
2(p
*nB)+1 grade is calculated to S level base-2 butterfly, thereby complete FFT efficient parallel, realizes optimization method.
Compared with prior art, beneficial effect of the present invention is embodied in:
1, the general Parallel FFT of realizing on No. 3 processors of Godson with directly transplanting is compared, and the present invention processes by deblocking, makes the data of each thread process can put into cache, has improved the hit rate of cache, thereby has promoted memory access performance.
2, because the present invention has adopted piecemeal innovative approach, make FFT parallel computation under each length, there is suitable piecemeal, the memory access performance of system significantly promotes;
3, because the present invention carries out data locality optimization to FFT parallel computation, reduce cache inefficacy, promoted the travelling speed of program.
4, through testing and verification, the maximum speed-up ratio of the present invention double-core FFT on four core Godson 3A has reached 2.49, and average speedup is 2.16; Speed-up ratio maximum in four core situations has reached 4.20, and average speedup is 3.75; On eight core Godson 3B, the maximum speed-up ratio of double-core FFT can reach 2.38, and average speedup is 2.11; Speed-up ratio maximum in four core situations can reach 3.97, and average speedup is 3.58; Speed-up ratio maximum in eight core situations can reach 7.00, and average speedup has also reached 6.14.
Embodiment
The FFT efficient parallel that the present invention is based on No. 3 processors of Godson is realized optimization method, first on Godson 3A/3B, transplants and realizes a Parallel FFT based on sharing storage programming; Then source input data vector is divided into some less subvectors, the data block size that finally makes each thread process is at one-level cache mono-half-sum secondary cache between half; Then in certain data length, piecemeal is chosen to all situations of one-level cache mono-half-sum secondary cache between half and carry out performance test, finally in any length, choose most suitable minute block size; Finally utilize data locality principle to come Parallel FFT optimization.
In the present embodiment, first on No. 3 processors of Godson, transplant the Parallel FFT based on OpenMP programming of having realized after improving; Then by source vector, according to a minute block size, be that 32*1024 divides; By piecemeal improvement project, obtain most suitable minute block size in each length again; Finally by data locality principle, to Parallel FFT, optimization improves.
It is to adopt base-2 butterfly calculate and carry out as follows that FFT efficient parallel is realized optimization method:
Step 1, No. 3 processors of Godson are on-chip multi-processors (CMP), employing be the storage organization of sharing secondary cache.For this multinuclear storage architecture, adopt OpenMP to carry out multiple programming, on No. 3 processors of Godson, realize a Parallel FFT based on shared-memory model;
First be that initiation parameter is set, initiation parameter is: the check figure p of the length N of source vector, No. 3 processors of Godson and piecemeal length N B;
Step 2, the progression S that utilizes formula (1) acquisition FFT to convert:
S=log
2N (1)
Step 3, utilize formula (2) to obtain each twiddle factor
In formula (2),
represent k twiddle factor, k belongs to [1, N/2]; Its core code is as follows:
① #pragma omp parallel for
2. for (k=0,1 ... N/2-1) // calculate twiddle factor
③
4. cbrev (x) // to input data carry out bit-reversed
Step 4, the source vector that is N by the length that participates in FFT conversion are divided into the subvector that p length is m, i.e. m=N/p,
If m is less than NB, source vector, without carrying out piecemeal processing, is directly carried out parallel FFT conversion, until complete S level base-2 butterfly, calculates;
If m is not less than NB, perform step 5;
Step 5, by each subvector piecemeal, be each data block, the span of setting piecemeal length N B is [a, b], and a is half of one-level cache memory L1-cache length; B is half of l2 cache memory L2-cache length;
The data block that is NB by each minute block length is assigned on each processor cores of No. 3 processors of Godson and carries out parallel FFT conversion, and making the assigned data block length of each processor core is NB, until complete log
2(p
*nB) level base-2 butterfly is calculated, and obtains intermediate vector; The length of intermediate vector is N, so far, the first stage that FFT calculates completes, between each thread of this stage without communication, it is relevant can to there are not a series of data in the data on the shared drive of namely processing, and the data of thread process can not be that an another thread process is crossed or be about to data to be processed.Its core code is as follows:
① #pragma omp parallel for private(L,J,k,b,B,c,i,t)
② for(L=1,...,log(m))
3. id=omp_get_thread_num () // the obtain core number of current execution
4. b=2
l; B=b/2; The number of c=N/b//identical twiddle factor, if c=1 represents all not identical
⑤ for(J=0,1,...,B)
6. i=(J%b)
*c//obtain twiddle factor label
⑦ for(k=J;k<n;k=k+b)
⑧
⑨ x(k+B+id
*m)=x(k+id
*m)-t
⑩ x(k+id
*m)=x(k+id
*m)+t
Intermediate vector is pressed to identical twiddle factor
be divided into p dynatron vector, dynatron vector length is m, dynatron vector fractional integration series is fitted on each processor cores of No. 3 processors of Godson and carries out, until complete from log
2(p
*nB)+1 grade is calculated to S level base-2 butterfly.So far, parallel FFT calculates subordinate phase and completes, and in this stage, now the data of each thread process are exactly the data that each thread process is crossed after the first stage, and the data of every grade of processing are all different.This stage core butterfly Accounting Legend Code is identical with the first stage, during just every grade of butterfly in this stage is calculated, by same thread process, with the butterfly of twiddle factor, calculates to avoid the access repeatedly to twiddle factor, promotes cache hit rate.Thereby reach FFT efficient parallel and realize optimization method.
After Integrated using above-mentioned optimisation technique of the present invention, under No. 3 architectures of Godson, the central processing unit of 3A (CPU) work dominant frequency is 900MHZ, and the CPU work dominant frequency of 3B is 800MHZ, and testing length is N=2
15, 2
16..., 2
25.Experimental result shows, has adopted after the inventive method optimization, and on four core Godson 3A, the maximum speed-up ratio of double-core FFT has reached 2.49, and average speedup is 2.16; Speed-up ratio maximum in four core situations has reached 4.20, and average speedup is 3.75; On eight core Godson 3B, the maximum speed-up ratio of double-core FFT can reach 2.38, and average speedup is 2.11; Speed-up ratio maximum in four core situations can reach 3.97, and average speedup is 3.58; Speed-up ratio maximum in eight core situations can reach 7.00, and average speedup has also reached 6.14.