CN104699624A

CN104699624A - FFT (fast Fourier transform) parallel computing-oriented conflict-free storage access method

Info

Publication number: CN104699624A
Application number: CN201510137874.6A
Authority: CN
Inventors: 陈海燕; 刘胜; 陈书明; 郭阳; 燕世林; 刘仲; 万江华; 陈胜刚; 杨超; 梁停雨
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2015-03-26
Filing date: 2015-03-26
Publication date: 2015-06-10
Anticipated expiration: 2035-03-26
Also published as: CN104699624B

Abstract

The invention discloses an FFT (fast Fourier transform) parallel computing-oriented conflict-free storage access method. The method includes the steps of 1, judging the structure of a current processor; if the structure is an SIMD (single instruction multiple data) structure, executing the step 3; if not, executing the step 2; 2, configuring a storage set to store computing data, with the storage set comprising a plurality of parallel single-port storages; during executing FFT computing, mapping addresses of data to be computed, as the corresponding target storages and two-dimensional conflict-free memory access addresses of inner addresses of the target storages; 3, configuring a plurality of parallel storage sets to store the computing data, with each storage set comprising a plurality of parallel single-port storages; during executing the FFT computing, mapping the addresses of the data to be computed, as the corresponding target storage sets and three-dimensional conflict-free memory access addresses of inner address of the target storages. The method has the advantages that conflict-free access of the FFT parallel computing is achieved, memory access efficiency is high and hardware cost is low.

Description

Towards the Lothrus apterus memory access method of FFT parallel computation

Technical field

The present invention relates to FFT computing field in microprocessor, particularly relate to a kind of Lothrus apterus memory access method towards FFT parallel computation.

Background technology

FFT (Fast Fourier Transform, Fast Fourier Transform (FFT)) to be nineteen sixty-five proposed by J.W. Cooley and T.W. figure base algorithm realizes discrete Fourier transformation (Discrete Fourier Transform, DFT) a kind of fast algorithm, be the core algorithm in many Embedded Application such as radio communication, image procossing, the height of its operational performance often decides the processing capability in real time of whole digital processing system.The performance of development to FFT of application demand it is also proposed more and more higher requirement, and along with the development of digital signal processor techniques, making to realize efficient programmable FFT parallel algorithm becomes possibility.

The implementation method of fft algorithm common is at present divided into two kinds, and the first is special FFT hardware accelerator, and such as, based on FPGA mode or it can be used as FFT hardware co-processor in microprocessor sheet, it only accelerates for fft algorithm; The second is the software programming realization based on general purpose microprocessor or digital signal processor instructions architecture.The restricted application of first method, the development and change that can not satisfy the demands, and realization price of hardware is high, lack dirigibility; Second method is owing to being the realization based on instruction set programmed method, thus there is certain dirigibility, versatility, and along with the development of high-performance microprocessor technology, make to adopt and also obtain the operational performance suitable with special FFT hardware accelerator in this way.

The DFT computing formula of N point sequence x (n) is as follows:

X (k) = Σ_{n = 0}^{N - 1} x (n) W_{N}^{nk} - - - (1)

Wherein, 0≤k<N, hypothetical sequence length N is the integer power of 2.

The fft algorithm of base 2 temporal decimation utilizes twiddle factor symmetry, periodicity and reducibility, N point sequence x (n) is separated in half by before and after sequence number, by N point DFT X (k), k=0,1 ..., N-1, is divided into the DFT of two N/2 points, that is: by the odd even of frequency domain sequence number

X (k) = Σ_{n = 0}^{N / 2 - 1} x (2 n) W_{N / 2}^{nk} + W_{N}^{k} Σ_{n = 0}^{N / 2 - 1} x (2 n + 1) W_{N / 2}^{nk} - - - (2)

Make k=2r, k=2r+1, wherein r=0,1,2 ..., N/2-1, X (k) by odd even sequence number separately, have:

\{\begin{matrix} X (2 r) = Σ_{n = 0}^{N / 2 - 1} [x (n) + x (n + N / 2)] W_{N / 2}^{rn} \\ X (2 r + 1) = Σ_{n = 0}^{N / 2 - 1} [x (n) - x (n + N / 2)] W_{N}^{n} W_{N / 2}^{rn} \end{matrix} - - - (3)

As N/2 remains even number, then continue to decompose, until 2 DFT by such as upper type.

As shown in Figure 1, the FFT of 16 point sequence bases 2 calculates and is decomposed into 8 points, 4 points, 2 DFT successively the base 2 butterfly computation flow process of the sequence X of N=16 point.Length is that the fft algorithm of the sequence base 2 of N needs to carry out log ₂n level, every grade have again N butterfly computation, in every grade of butterfly unit, former and later two treat that operational data is equally spaced, and carry out the butterfly computation of same structure, and the data break of butterfly unit is N/2 ^j, wherein j is the progression of butterfly computation, j=1,2 ... log ₂n; Again two number sums are deposited back the original position of first data, the differences of two numbers and the long-pending original position then depositing back second data of butterfly coefficient, this characteristic of FFT calculating be applicable to very much carrying out data parallel processing and in the computing of SIMD expansion structure witness vector.

Along with the development of integrated circuit technique and performance requirement, single instruction stream multiple data stream (Single Instruction MultipleData, SIMD) structure has become the important expansion of high-performance microprocessor, and single-chip is the increasing functional unit of accessible site also.Adopt superscale or very long instruction word (Very LongInstruction Word, VLIW) structure, multiple functional unit can be made to carry out computing in SIMD mode to data, and to develop more instruction-level, data level walks abreast, thus obtains higher operational performance.For making full use of multiplier, the totalizer in microprocessor arithmetic element, improve counting yield, high-performance microprocessor usually support two access bandwidth (or more) parallel accessing operation.A butterfly computation of FFT, except coefficient constant, also needs a bat can provide two operands, and therefore FFT computing needs the two access bandwidths utilizing microprocessor to provide operand.

Due to 2 times that the area of the identical dual-port memory bank of capacity and power consumption are generally single port memory banks, and mass storage area and power consumption have strict restriction on sheet, therefore on sheet storage organization generally select quantity be 2 an integral number power single port memory bank, organize by low level crossing parallel mode, two access bandwidth can be provided with lower side sum power consumption cost.But due to the uncontinuity treating operational data address and the symmetry of FFT butterfly computation, all there is parallel memory access conflict in each butterfly computation; Special in SIMD expansion structure, memory access conflict causes vectorial memory bandwidth service efficiency to reduce, and the actual computation efficiency of FFT will significantly lower than theoretical peak.

Summary of the invention

The technical problem to be solved in the present invention is just: the technical matters existed for prior art, the invention provides a kind of implementation method simple, the high and Lothrus apterus memory access method towards FFT parallel computation that hardware consumption is little of memory access conflict in FFT parallel computation, memory access efficiency can be eliminated.

For solving the problems of the technologies described above, the technical scheme that the present invention proposes is:

Towards a Lothrus apterus memory access method for FFT parallel computation, step comprises:

1) judge the structure of current processor, if SIMD structure, proceed to and perform step 3); Otherwise proceed to and perform step 2);

2) configure a storage sets and store operational data, described storage sets comprises multiple parallel single port memory bank; When performing FFT calculating, to treat that the linear address of operational data is mapped as two-dimentional Lothrus apterus memory access address, described two-dimentional Lothrus apterus memory access address corresponds to the target storage volume for the treatment of operational data place and address in target storage volume, carries out memory access according to described two-dimentional Lothrus apterus memory access address to data;

3) configure multiple storage sets and store operational data, each storage sets comprises multiple parallel single port memory bank; When performing FFT calculating, to treat that the linear address of operational data is mapped as three-dimensional Lothrus apterus memory access address, described three-dimensional memory access address correspond to treat operational data place target storage sets, target storage volume and address in target storage volume, according to described three-dimensional Lothrus apterus memory access address, memory access is carried out to data.

As a further improvement on the present invention: described step 2), a wherein P memory bank of multiple parallel single port memory bank addresses according to low level interleaved mode, and P be greater than 3 odd number; Step 3) in each storage sets wherein P memory bank address according to low level interleaved mode, and P be greater than 3 odd number.

To treat that the linear address of operational data is mapped as two-dimentional Lothrus apterus memory access address (X, Y) according to the following formula as a further improvement on the present invention: described step 2);

Wherein, Y is the target storage volume position for the treatment of operational data place, X is treating the row address of operational data in target storage volume, Addr be treat operational data linear address, W is for treating operational data granularity, p is the memory bank number adopting the addressing of low level intersection, and mod represents modulo operation, and N is the sequence length that FFT calculates.

To treat that the linear address of operational data is mapped as three-dimensional Lothrus apterus memory access address (X, Y, Z) according to the following formula as a further improvement on the present invention: described step 3);

Wherein, Y is the target storage volume group position for the treatment of operational data place, and Z is the position for the treatment of operational data target storage volume in target storage sets, and X treats the row address of operational data in target storage volume; Addr is the linear address treating operational data, and G is SIMD width and G is the positive integer pwoer of 2, and p often organizes in storage sets the memory bank number adopting the addressing of low level intersection, and mod represents modulo operation, and N is the sequence length that FFT calculates.

Compared with prior art, the invention has the advantages that:

1) the present invention passes through the two access microprocessors respectively for non-SIMD structure and SIMD structure, single port memory bank is organized as respectively many somas mode of one-dimension storage group, two-dimensional storage group, when performing FFT and calculating, by treating that the linear address branch of operational data is mapped as two-dimentional Lothrus apterus memory access address, three-dimensional Lothrus apterus memory access address, effectively can eliminate the memory access conflict in FFT computing, the Lothrus apterus realizing FFT computing walks abreast memory access, improves FFT operation efficiency simultaneously.

2) the present invention is directed to the microprocessor with SIMD expansion structure, it is the two-dimensional storage array structure operated by SIMD mode by many for single port memory bank somas, the Lothrus apterus memory access that the vectorization of FFT parallel algorithm is expanded can be supported, thus significantly improve the operation efficiency of FFT.

3) by not adopting in SIMD structure, the present invention is by treating that the linear address of operational data is mapped as corresponding memory bank, address in memory bank, treat that the linear address of operational data is mapped as corresponding storage sets, memory bank and address in memory bank by adopting in SIMD structure, only change the account form of memory access address, thus required hardware spending is very little.

Accompanying drawing explanation

Fig. 1 to be length be base 2FFT butterfly computation of 16 realize principle schematic.

Fig. 2 is the realization flow schematic diagram of the present embodiment based on the Lothrus apterus memory access method towards FFT parallel computation.

Fig. 3 is the structural principle schematic diagram of memory bank tissue under non-SIMD structure in the present embodiment.

Fig. 4 is the structural principle schematic diagram of memory bank tissue under SIMD structure in embodiment.

Fig. 5 is the structural principle schematic diagram of memory bank tissue under non-SIMD structure in the specific embodiment of the invention.

Fig. 6 is the structural principle schematic diagram of memory bank tissue under SIMD structure in the specific embodiment of the invention.

Embodiment

Below in conjunction with Figure of description and concrete preferred embodiment, the invention will be further described, but protection domain not thereby limiting the invention.

As shown in Figure 2, the present embodiment is towards the Lothrus apterus memory access method of FFT parallel computation, and step comprises:

2) configure a storage sets and store operational data, storage sets comprises multiple parallel single port memory bank; When performing FFT calculating, to treat that the linear address of operational data is mapped as two-dimentional Lothrus apterus memory access address, two dimension Lothrus apterus memory access address corresponds to the target storage volume for the treatment of operational data place and address in target storage volume, carries out memory access according to two-dimentional Lothrus apterus memory access address to data;

3) configure multiple storage sets and store operational data, each storage sets comprises multiple parallel single port memory bank; When performing FFT calculating, to treat that the linear address of operational data is mapped as three-dimensional Lothrus apterus memory access address, three-dimensional memory access address correspond to treat operational data place target storage sets, target storage volume and address in target storage volume, according to three-dimensional Lothrus apterus memory access address, memory access is carried out to data.

In the present embodiment, for two access store bandwidth demands of microprocessor, build on-chip memory based on multiple single port memory bank sram (StaticRAM), and multiple single port memory bank sram supports parallel memory access, to reduce area and power consumption.

In the present embodiment, step 2) in multiple parallel single port memory bank wherein P memory bank address according to low level interleaved mode, and P be greater than 3 odd number; Step 3) in each storage sets wherein P memory bank address according to low level interleaved mode, and P be greater than 3 odd number.

In the present embodiment, step 2) in will treat that the linear address of operational data is mapped as two-dimentional Lothrus apterus memory access address (X, Y) according to formula (4);

Wherein, Y is the target storage volume position for the treatment of operational data place, X is treating the row address of operational data in target storage volume, Addr be treat operational data linear address, W is for treating operational data granularity, p is the memory bank number adopting the addressing of low level intersection, and mod represents modulo operation, and N is the sequence length that FFT calculates. represent and get the maximum integer being less than or equal to Addr/W.

The present embodiment, in the processor not adopting SIMD structure, supposes that memory span is 2 ^hbyte, H is positive integer, treats that operational data granularity is W byte, and supposes that W is the positive integer pwoer of 2, all memory banks adopts or part adopts low level to intersect the structure of addressing, wherein adopt low level intersect the memory bank of addressing be P (P be not less than 3 odd number).As shown in Figure 3, the byte address of whole storer is H position, be expressed as Addr [H-1:0], to treat that data address in units of operational data granularity is for Data_Addr=Addr/W, the actual address of data in memory bank can use two-dimensional coordinate (X, Y) represent, wherein Y represents the sequence number of actual memory access memory bank sram, X represents and is choosing in memory bank the row address treating operational data place, the linear address Addr of memory bank and actual address (X, Y) mapping relations are such as formula shown in (4), actual address (X, Y) the two-dimentional Lothrus apterus memory access address mapping and obtain is.

Above-mentioned memory bank organizational form is adopted in the processor not adopting SIMD structure, when carrying out FFT computing (N is the positive integer pwoer of 2) that sequence length is N, in the process by the butterfly computation of two access Parallel Implementation FFT, omnidistance Lothrus apterus accessing operation can be realized.

In the present embodiment, step 3) in will treat that the linear address of operational data is mapped as three-dimensional Lothrus apterus memory access address (X, Y, Z) according to the following formula;

The present embodiment is in employing SIMD architecture processor, suppose that SIMD width is G, G is the positive integer pwoer of 2, carrying out quantity by not adopting the bank structure of SIMD structure is that the SIMD expansion of G obtains the memory bank of SIMD structure, wherein single storage organization still has the structure of low level intersection addressing in whole or in part, and wherein adopting low level to intersect the memory bank of addressing is P ' individual (P ' odd number) for being not less than 3, and each memory bank width is W byte.As shown in Figure 4, suppose that memory span is 2 ^hbyte, treat that operational data width is W byte, and suppose that W is the integer power of 2, sequence length is N, the byte address of whole storer is H position, be expressed as Addr [H-1:0], to treat that data address in units of operational data granularity is for Data_Addr=Addr/W, the actual address of data in memory bank can use three-dimensional coordinate (X, Y, Z) represent, wherein Y represents position in G region, actual memory access address, Z represents the position in the memory bank of p, this region, X then represents corresponding row address, the linear address Addr of whole or part memory bank and actual address (X, Y, Z) mapping relations are such as formula shown in (5), actual address (X, Y, Z) the three-dimensional Lothrus apterus memory access address mapping and obtain is.

In the processor adopting SIMD structure, adopt above-mentioned memory bank organizational form, when carrying out the computing of FFT vectorized parallel by two access, omnidistance Lothrus apterus can be realized and perform.

Below with operational data width W for 4, sequence length N is that example further illustrates the present invention.

As shown in Figure 5, in the present embodiment under non-SIMD structure, adopt P=3 memory bank to carry out the addressing of low level intersection, operational data width W is 4, sequence length N, two dimension Lothrus apterus memory access address coordinate (X, Y) represent, wherein Y represents the sequence number of target memory access memory bank sram, and X represents and treats the row address of operational data in target storage volume, to treat that the linear address Addr of operational data is mapped as two-dimentional Lothrus apterus memory access address (X, Y) by formula (6):

Wherein, Y represents and treats operational data position in 3 memory banks, and X represents the row address treating that operational data is corresponding in memory bank.

As shown in Figure 6, in the present embodiment under SIMD structure, often organize in set of memory banks and adopt P=3 memory bank to carry out the addressing of low level intersection, operational data width W is 4, sequence length N, SIMD width gets 16, then whole set of memory banks has been divided into 16 regions, three-dimensional Lothrus apterus memory access address coordinate (X, Y, Z) represent, will treat that the linear address Addr of operational data is mapped as three-dimensional Lothrus apterus memory access address (X by formula (7), Y, Z):

Wherein, Y represents and treats operational data position in 16 regions, and Z represents and treats the position of operational data in the memory bank of 3, this region, and X then represents the row address treating that operational data is corresponding in memory bank.

Above-mentioned just preferred embodiment of the present invention, not does any pro forma restriction to the present invention.Although the present invention discloses as above with preferred embodiment, but and be not used to limit the present invention.Therefore, every content not departing from technical solution of the present invention, according to the technology of the present invention essence to any simple modification made for any of the above embodiments, equivalent variations and modification, all should drop in the scope of technical solution of the present invention protection.

Claims

1., towards a Lothrus apterus memory access method for FFT parallel computation, it is characterized in that, step comprises:

2) configure a storage sets and store operational data, described storage sets comprises multiple parallel single port memory bank; When performing FFT calculating, to treat that the linear address of operational data is mapped as two-dimentional Lothrus apterus memory access address, described two-dimentional Lothrus apterus memory access address corresponds to the target storage volume for the treatment of operational data place and address in target storage volume, carries out data memory access according to described two-dimentional Lothrus apterus memory access address;

3) configure multiple storage sets and store operational data, each storage sets comprises multiple parallel single port memory bank; When performing FFT calculating, to treat that the linear address of operational data is mapped as three-dimensional Lothrus apterus memory access address, described three-dimensional memory access address correspond to treat operational data place target storage sets, target storage volume and address in target storage volume, carry out data memory access according to described three-dimensional Lothrus apterus memory access address.

2. the Lothrus apterus storage means towards FFT parallel computation according to claim 1, it is characterized in that: described step 2) in the wherein P memory bank of multiple parallel single port memory bank address according to low level interleaved mode, and P be greater than 3 odd number; Step 3) in each storage sets wherein P memory bank address according to low level interleaved mode, and P be greater than 3 odd number.

3. the Lothrus apterus storage means towards FFT parallel computation according to claim 2, is characterized in that: described step 2) in will treat that the linear address of operational data is mapped as two-dimentional Lothrus apterus memory access address (X, Y) according to the following formula;

4. the Lothrus apterus storage means towards FFT parallel computation according to Claims 2 or 3, is characterized in that: described step 3) in will treat that the linear address of operational data is mapped as three-dimensional Lothrus apterus memory access address (X, Y, Z) according to the following formula;