CN103678255A - FFT efficient parallel achieving optimizing method based on Loongson number three processor - Google Patents

FFT efficient parallel achieving optimizing method based on Loongson number three processor Download PDF

Info

Publication number
CN103678255A
CN103678255A CN201310689271.8A CN201310689271A CN103678255A CN 103678255 A CN103678255 A CN 103678255A CN 201310689271 A CN201310689271 A CN 201310689271A CN 103678255 A CN103678255 A CN 103678255A
Authority
CN
China
Prior art keywords
fft
length
godson
vector
processors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310689271.8A
Other languages
Chinese (zh)
Inventor
顾乃杰
江国荐
任开新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HEFEI COMJAY INFORMATION TECHNOLOGY CO., LTD.
Original Assignee
HEFEI YOURUAN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HEFEI YOURUAN INFORMATION TECHNOLOGY Co Ltd filed Critical HEFEI YOURUAN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310689271.8A priority Critical patent/CN103678255A/en
Publication of CN103678255A publication Critical patent/CN103678255A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses an FFT efficient parallel achieving optimizing method based on a Loongson number three processor. The FFT efficient parallel achieving optimizing method is characterized by comprising the following steps which are carried out by means of base-2 butterfly computation. Firstly, initial parameters are set; secondly, the number of grades of FFT conversion is obtained; thirdly, all twiddle factors are obtained; fourthly, molecular vectors are divided and whether blocking processing is carried out or not is judged; fifthly, blocking processing is carried out. The problem that the speed-up ratio of an existing parallel FFT algorithm on the Loongson number three processor is low can be solved by means of the FFT efficient parallel achieving optimizing method based on the Loongson number three processor and efficient parallel of FFT on the Loongson number three processor can be achieved.

Description

A kind of FFT efficient parallel based on No. 3 processors of Godson is realized optimization method
Technical field
The invention belongs to electric Digital data processing technical field, be specifically related to FFT efficient parallel on No. 3 processors of Godson and realize optimization method.
Background technology
A domestic high performance general risc processor of developing calculates in No. 3 processor Shi You Chinese Academy of Sciences of Godson, and it is based on MIPS instruction-level collection, and has high integration, high-performance, low-power consumption and the good characteristic such as low-cost.No. 3 processors of Godson comprise four core Godson 3A processors and eight core Godson 3B processors, are mainly towards high performance machine application and high-end server.Fast fourier transform FFT (Fast Fourier Translation), be department of computer science's one of the most effective algorithm of unifying in digital display circuit application, and be widely used in the fields such as voice signal processing, image processing, power Spectral Estimation, Radar Signal Processing.Fft algorithm has computation-intensive and the intensive feature of storage, is often used as the concurrent testing benchmark of HPC, NAS.The Parallel FFT of at present practical application is owing to not doing special optimization for No. 3 processors of Godson, thereby general Parallel FFT is implanted in merely and on No. 3 processors of Godson, does not obtain operation speed-up ratio preferably.
Summary of the invention
The present invention is for avoiding the existing weak point of above-mentioned prior art, provide a kind of FFT efficient parallel based on No. 3 processors of Godson to realize optimization method, solve the situation of existing Parallel FFT low speed-up ratio on No. 3 processors of Godson, the efficient parallel that reaches FFT on No. 3 processors of Godson is realized.
The present invention adopts following scheme for solving above technical matters:
A kind of FFT efficient parallel based on No. 3 processors of Godson of the present invention is realized optimization method, is to adopt base-2 butterfly calculate and carry out as follows:
Step 1, initiation parameter is set, described initiation parameter is: the check figure p of the length N of source vector, No. 3 processors of Godson and piecemeal length N B;
Step 2, utilize formula (1) to obtain the progression S of described FFT conversion:
S=log 2N (1)
Step 3, utilize formula (2) to obtain each twiddle factor
W N k = e - j 2 π N k - - - ( 2 )
In formula (2),
Figure BDA0000438705760000013
represent k twiddle factor, k belongs to [1, N/2];
Step 4, the source vector that is N by length are divided into the subvector that p length is m;
If m is less than NB, directly carry out parallel FFT conversion, until complete S level base-2 butterfly, calculate;
If m is not less than NB, perform step 5;
Step 5, by described each subvector piecemeal, be each data block, the span of setting piecemeal length N B is [a, b], and described a is half of one-level cache memory L1-cache length; Described b is half of l2 cache memory L2-cache length;
The data block that is NB by each minute block length is assigned on each processor cores of No. 3 processors of described Godson and carries out parallel FFT conversion, until complete log 2(p *nB) level base-2 butterfly is calculated, and obtains intermediate vector;
Described intermediate vector is pressed to identical twiddle factor
Figure BDA0000438705760000021
be divided into p dynatron vector, described dynatron vector length is m, described dynatron vector fractional integration series is fitted on each processor cores of No. 3 processors of described Godson and carries out, until complete from log 2(p *nB)+1 grade is calculated to S level base-2 butterfly, thereby complete FFT efficient parallel, realizes optimization method.
Compared with prior art, beneficial effect of the present invention is embodied in:
1, the general Parallel FFT of realizing on No. 3 processors of Godson with directly transplanting is compared, and the present invention processes by deblocking, makes the data of each thread process can put into cache, has improved the hit rate of cache, thereby has promoted memory access performance.
2, because the present invention has adopted piecemeal innovative approach, make FFT parallel computation under each length, there is suitable piecemeal, the memory access performance of system significantly promotes;
3, because the present invention carries out data locality optimization to FFT parallel computation, reduce cache inefficacy, promoted the travelling speed of program.
4, through testing and verification, the maximum speed-up ratio of the present invention double-core FFT on four core Godson 3A has reached 2.49, and average speedup is 2.16; Speed-up ratio maximum in four core situations has reached 4.20, and average speedup is 3.75; On eight core Godson 3B, the maximum speed-up ratio of double-core FFT can reach 2.38, and average speedup is 2.11; Speed-up ratio maximum in four core situations can reach 3.97, and average speedup is 3.58; Speed-up ratio maximum in eight core situations can reach 7.00, and average speedup has also reached 6.14.
Embodiment
The FFT efficient parallel that the present invention is based on No. 3 processors of Godson is realized optimization method, first on Godson 3A/3B, transplants and realizes a Parallel FFT based on sharing storage programming; Then source input data vector is divided into some less subvectors, the data block size that finally makes each thread process is at one-level cache mono-half-sum secondary cache between half; Then in certain data length, piecemeal is chosen to all situations of one-level cache mono-half-sum secondary cache between half and carry out performance test, finally in any length, choose most suitable minute block size; Finally utilize data locality principle to come Parallel FFT optimization.
In the present embodiment, first on No. 3 processors of Godson, transplant the Parallel FFT based on OpenMP programming of having realized after improving; Then by source vector, according to a minute block size, be that 32*1024 divides; By piecemeal improvement project, obtain most suitable minute block size in each length again; Finally by data locality principle, to Parallel FFT, optimization improves.
It is to adopt base-2 butterfly calculate and carry out as follows that FFT efficient parallel is realized optimization method:
Step 1, No. 3 processors of Godson are on-chip multi-processors (CMP), employing be the storage organization of sharing secondary cache.For this multinuclear storage architecture, adopt OpenMP to carry out multiple programming, on No. 3 processors of Godson, realize a Parallel FFT based on shared-memory model;
First be that initiation parameter is set, initiation parameter is: the check figure p of the length N of source vector, No. 3 processors of Godson and piecemeal length N B;
Step 2, the progression S that utilizes formula (1) acquisition FFT to convert:
S=log 2N (1)
Step 3, utilize formula (2) to obtain each twiddle factor
Figure BDA0000438705760000031
W N k = e - j 2 π N k - - - ( 2 )
In formula (2),
Figure BDA0000438705760000033
represent k twiddle factor, k belongs to [1, N/2]; Its core code is as follows:
① #pragma omp parallel for
2. for (k=0,1 ... N/2-1) // calculate twiddle factor
W N k = e - j 2 π N k
4. cbrev (x) // to input data carry out bit-reversed
Step 4, the source vector that is N by the length that participates in FFT conversion are divided into the subvector that p length is m, i.e. m=N/p,
If m is less than NB, source vector, without carrying out piecemeal processing, is directly carried out parallel FFT conversion, until complete S level base-2 butterfly, calculates;
If m is not less than NB, perform step 5;
Step 5, by each subvector piecemeal, be each data block, the span of setting piecemeal length N B is [a, b], and a is half of one-level cache memory L1-cache length; B is half of l2 cache memory L2-cache length;
The data block that is NB by each minute block length is assigned on each processor cores of No. 3 processors of Godson and carries out parallel FFT conversion, and making the assigned data block length of each processor core is NB, until complete log 2(p *nB) level base-2 butterfly is calculated, and obtains intermediate vector; The length of intermediate vector is N, so far, the first stage that FFT calculates completes, between each thread of this stage without communication, it is relevant can to there are not a series of data in the data on the shared drive of namely processing, and the data of thread process can not be that an another thread process is crossed or be about to data to be processed.Its core code is as follows:
① #pragma omp parallel for private(L,J,k,b,B,c,i,t)
② for(L=1,...,log(m))
3. id=omp_get_thread_num () // the obtain core number of current execution
4. b=2 l; B=b/2; The number of c=N/b//identical twiddle factor, if c=1 represents all not identical
⑤ for(J=0,1,...,B)
6. i=(J%b) *c//obtain twiddle factor label
⑦ for(k=J;k<n;k=k+b)
t = x ( k + B + i d * m ) * W N i
⑨ x(k+B+id *m)=x(k+id *m)-t
⑩ x(k+id *m)=x(k+id *m)+t
Intermediate vector is pressed to identical twiddle factor
Figure BDA0000438705760000041
be divided into p dynatron vector, dynatron vector length is m, dynatron vector fractional integration series is fitted on each processor cores of No. 3 processors of Godson and carries out, until complete from log 2(p *nB)+1 grade is calculated to S level base-2 butterfly.So far, parallel FFT calculates subordinate phase and completes, and in this stage, now the data of each thread process are exactly the data that each thread process is crossed after the first stage, and the data of every grade of processing are all different.This stage core butterfly Accounting Legend Code is identical with the first stage, during just every grade of butterfly in this stage is calculated, by same thread process, with the butterfly of twiddle factor, calculates to avoid the access repeatedly to twiddle factor, promotes cache hit rate.Thereby reach FFT efficient parallel and realize optimization method.
After Integrated using above-mentioned optimisation technique of the present invention, under No. 3 architectures of Godson, the central processing unit of 3A (CPU) work dominant frequency is 900MHZ, and the CPU work dominant frequency of 3B is 800MHZ, and testing length is N=2 15, 2 16..., 2 25.Experimental result shows, has adopted after the inventive method optimization, and on four core Godson 3A, the maximum speed-up ratio of double-core FFT has reached 2.49, and average speedup is 2.16; Speed-up ratio maximum in four core situations has reached 4.20, and average speedup is 3.75; On eight core Godson 3B, the maximum speed-up ratio of double-core FFT can reach 2.38, and average speedup is 2.11; Speed-up ratio maximum in four core situations can reach 3.97, and average speedup is 3.58; Speed-up ratio maximum in eight core situations can reach 7.00, and average speedup has also reached 6.14.

Claims (1)

1. the FFT efficient parallel based on No. 3 processors of Godson is realized an optimization method, it is characterized in that, it is to adopt base-2 butterfly calculate and carry out as follows that FFT efficient parallel is realized optimization method:
Step 1, initiation parameter is set, described initiation parameter is: the check figure p of the length N of source vector, No. 3 processors of Godson and piecemeal length N B;
Step 2, utilize formula (1) to obtain the progression S of described FFT conversion:
S=log 2N (1)
Step 3, utilize formula (2) to obtain each twiddle factor
Figure FDA0000438705750000012
W N k = e - j 2 &pi; N k - - - ( 2 )
In formula (2), represent k twiddle factor, k belongs to [1, N/2];
Step 4, the source vector that is N by length are divided into the subvector that p length is m;
If m is less than NB, directly carry out parallel FFT conversion, until complete S level base-2 butterfly, calculate;
If m is not less than NB, perform step 5;
Step 5, by described each subvector piecemeal, be each data block, the span of setting piecemeal length N B is [a, b], and described a is half of one-level cache memory L1-cache length; Described b is half of l2 cache memory L2-cache length;
The data block that is NB by each minute block length is assigned on each processor cores of No. 3 processors of described Godson and carries out parallel FFT conversion, until complete log 2(p *nB) level base-2 butterfly is calculated, and obtains intermediate vector;
Described intermediate vector is pressed to identical twiddle factor
Figure FDA0000438705750000014
be divided into p dynatron vector, described dynatron vector length is m, described dynatron vector fractional integration series is fitted on each processor cores of No. 3 processors of described Godson and carries out, until complete from log 2(p *nB)+1 grade is calculated to S level base-2 butterfly, thereby complete FFT efficient parallel, realizes optimization method.
CN201310689271.8A 2013-12-16 2013-12-16 FFT efficient parallel achieving optimizing method based on Loongson number three processor Pending CN103678255A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310689271.8A CN103678255A (en) 2013-12-16 2013-12-16 FFT efficient parallel achieving optimizing method based on Loongson number three processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310689271.8A CN103678255A (en) 2013-12-16 2013-12-16 FFT efficient parallel achieving optimizing method based on Loongson number three processor

Publications (1)

Publication Number Publication Date
CN103678255A true CN103678255A (en) 2014-03-26

Family

ID=50315868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310689271.8A Pending CN103678255A (en) 2013-12-16 2013-12-16 FFT efficient parallel achieving optimizing method based on Loongson number three processor

Country Status (1)

Country Link
CN (1) CN103678255A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902506A (en) * 2014-04-16 2014-07-02 中国科学技术大学先进技术研究院 FFTW3 optimization method based on loongson 3B processor
CN109829132A (en) * 2019-01-21 2019-05-31 东南大学 The quick spectral analysis method of long data sequence under a kind of embedded environment
CN115080503A (en) * 2022-07-28 2022-09-20 中国人民解放军63921部队 Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1217505A (en) * 1997-11-06 1999-05-26 中国科学院计算技术研究所 Address mapping technique and apparatus in high-speed buffer storage device
WO2003034269A1 (en) * 2001-10-12 2003-04-24 Pts Corporation Method of performing a fft transform on parallel processors
EP1426872A2 (en) * 2002-12-03 2004-06-09 STMicroelectronics Ltd. Linear scalable FFT/IFFT computation in a multi-processor system
CN101211333A (en) * 2006-12-30 2008-07-02 北京邮电大学 Signal processing method, device and system
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN102375805A (en) * 2011-10-31 2012-03-14 中国人民解放军国防科学技术大学 Vector processor-oriented FFT (Fast Fourier Transform) parallel computation method based on SIMD (Single Instruction Multiple Data)
CN102799564A (en) * 2012-06-28 2012-11-28 电子科技大学 Fast fourier transformation (FFT) parallel method based on multi-core digital signal processor (DSP) platform

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1217505A (en) * 1997-11-06 1999-05-26 中国科学院计算技术研究所 Address mapping technique and apparatus in high-speed buffer storage device
WO2003034269A1 (en) * 2001-10-12 2003-04-24 Pts Corporation Method of performing a fft transform on parallel processors
EP1426872A2 (en) * 2002-12-03 2004-06-09 STMicroelectronics Ltd. Linear scalable FFT/IFFT computation in a multi-processor system
CN101211333A (en) * 2006-12-30 2008-07-02 北京邮电大学 Signal processing method, device and system
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN102375805A (en) * 2011-10-31 2012-03-14 中国人民解放军国防科学技术大学 Vector processor-oriented FFT (Fast Fourier Transform) parallel computation method based on SIMD (Single Instruction Multiple Data)
CN102799564A (en) * 2012-06-28 2012-11-28 电子科技大学 Fast fourier transformation (FFT) parallel method based on multi-core digital signal processor (DSP) platform

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张燕燕等: ""windows环境下FFT多核并行算法的设计实现"", 《计算机技术与发展》 *
郭利财等: ""龙芯3A处理器上FFT的高效实现"", 《小型微型计算机系统》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902506A (en) * 2014-04-16 2014-07-02 中国科学技术大学先进技术研究院 FFTW3 optimization method based on loongson 3B processor
CN103902506B (en) * 2014-04-16 2017-02-15 中国科学技术大学先进技术研究院 FFTW3 optimization method based on loongson 3B processor
CN109829132A (en) * 2019-01-21 2019-05-31 东南大学 The quick spectral analysis method of long data sequence under a kind of embedded environment
CN115080503A (en) * 2022-07-28 2022-09-20 中国人民解放军63921部队 Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping

Similar Documents

Publication Publication Date Title
Li et al. A note on auto-tuning GEMM for GPUs
US20190042195A1 (en) Scalable memory-optimized hardware for matrix-solve
CN109597647A (en) Data processing method and equipment
CN102253919A (en) Concurrent numerical simulation method and system based on GPU and CPU cooperative computing
Juurlink et al. Amdahl's law for predicting the future of multicores considered harmful
CN103678255A (en) FFT efficient parallel achieving optimizing method based on Loongson number three processor
CN102902657A (en) Method for accelerating FFT (Fast Fourier Transform) by using GPU (Graphic Processing Unit)
CN102841881A (en) Multiple integral computing method based on many-core processor
Ramesh et al. Optimization and evaluation of image-and signal-processing kernels on the TI C6678 multi-core DSP
Zheng et al. GPU-Based Parallel Researches on RRTM Module of GRAPES Numerical Prediction System.
CN103577160A (en) Characteristic extraction parallel-processing method for big data
Barhen et al. High performance FFT on multicore processors
CN104462020A (en) Matrix increment reduction method based on knowledge granularity
CN104657334A (en) FFT (Fast Fourier Transform) radix-2-4-8 mixed-radix butterfly operator and application thereof
CN112559197A (en) Convolution calculation data reuse method based on heterogeneous many-core processor
CN103091708B (en) A kind of 3-D seismics tectonic erosion periods performance optimization method
Wyrzykowski et al. Using blue gene/P and GPUs to accelerate computations in the EULAG model
CN103902506A (en) FFTW3 optimization method based on loongson 3B processor
Wang et al. Harmonic-Summing Module of SKA on FPGA—Optimizing the Irregular Memory Accesses
Habich et al. Enabling temporal blocking for a lattice Boltzmann flow solver through multicore-aware wavefront parallelization
Wang et al. An efficient architecture for floating-point eigenvalue decomposition
Liu et al. Symplectic multi-particle tracking on GPUs
Zabunov Digital signal processing in RadioSolariz project using SSE2
Zhu et al. A hybrid CPU-MIC parallel Gaussian elimination algorithm for solving Gröbner bases in binary field
Yong et al. Dynamic probability based instruction scheduling for low-power embedded system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: HEFEI KANGJIE INFORMATION TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: HEFEI YOURUAN INFORMATION TECHNOLOGY CO., LTD.

Effective date: 20140619

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 230031 HEFEI, ANHUI PROVINCE TO: 230088 HEFEI, ANHUI PROVINCE

TA01 Transfer of patent application right

Effective date of registration: 20140619

Address after: 230088 A-3 R & D building, No. 800 Wangjiang West Road, Hefei hi tech Zone, Anhui, 809

Applicant after: HEFEI COMJAY INFORMATION TECHNOLOGY CO., LTD.

Address before: 301, room 616, 230031 Mount Huangshan Road, Hefei hi tech Zone, Anhui, China

Applicant before: HEFEI YOURUAN INFORMATION TECHNOLOGY CO., LTD.

RJ01 Rejection of invention patent application after publication

Application publication date: 20140326

RJ01 Rejection of invention patent application after publication