CN104699624B

CN104699624B - Lothrus apterus towards FFT parallel computations stores access method

Info

Publication number: CN104699624B
Application number: CN201510137874.6A
Authority: CN
Inventors: 陈海燕; 刘胜; 陈书明; 郭阳; 燕世林; 刘仲; 万江华; 陈胜刚; 杨超; 梁停雨
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2015-03-26
Filing date: 2015-03-26
Publication date: 2018-01-23
Anticipated expiration: 2035-03-26
Also published as: CN104699624A

Abstract

The present invention discloses a kind of Lothrus apterus towards FFT parallel computations and stores access method, and this method step includes：1) judge the structure of current processor, if SIMD architecture, perform step 3)；Otherwise step 2) is performed；2) a storage group storage operational data is configured, storage group includes multiple parallel single port memory banks；When performing FFT and calculating, the two-dimentional Lothrus apterus memory access address of the address of cache of the operational data address for corresponding target storage volume and in target storage volume will be treated；3) multiple parallel storage group storage operational datas are configured, every group of storage group includes multiple parallel single port memory banks；When performing FFT and calculating, by treat the address of cache of operational data for corresponding target storage group, target storage volume and in target storage volume address three-dimensional Lothrus apterus memory access address.The present invention can realize the conflict-free access of FFT parallel computations, have the advantages of memory access efficiency high and small hardware spending.

Description

Lothrus apterus towards FFT parallel computations stores access method

Technical field

The present invention relates to FFT computings field in microprocessor, more particularly to it is a kind of towards FFT parallel computations without punching Prominent storage access method.

Background technology

FFT (Fast Fourier Transform, Fast Fourier Transform (FFT)) algorithm be nineteen sixty-five by J.W. Cooleys and What T.W. figure base proposed realizes that one kind of discrete Fourier transform (Discrete Fourier Transform, DFT) is calculated quickly soon Method, is the core algorithm in many Embedded Applications such as radio communication, image procossing, and the height of its operational performance often decides The processing capability in real time of whole digital processing system.The continuous development of application demand it is also proposed more and more higher to FFT performance Requirement, with the development of digital signal processor techniques so that realize efficiently programmable FFT parallel algorithms be possibly realized.

The implementation method of fft algorithm common at present is divided into two kinds, and the first is special FFT hardware accelerator, such as Based on FPGA modes or as the FFT hardware co-processors in microprocessor piece, it is only used for fft algorithm acceleration；Second Kind is that the software programming based on general purpose microprocessor or digital signal processor instructions architecture is realized.First method Restricted application, it is impossible to the development and change of meet demand, and realization price of hardware is high, lacks flexibility；Second method by The realization of instruction set programmed method is then based on, thus with certain flexibility, versatility, and with high-performance microprocessor The development of device technology so that also obtain the operational performance suitable with special FFT hardware accelerator in this way.

N point sequence x (n) DFT calculation formula are as follows：

Wherein, 0≤k<N,Assuming that sequence length N is 2 integer power.

The fft algorithm of the temporal decimation of base 2 is to utilize twiddle factorSymmetry, periodicity and reducibility, by N point sequences X (n) is arranged by being separated in half before and after sequence number, by N point DFT X (k), k=0,1 ..., N-1, is divided into two by the odd even of frequency domain sequence number The DFT of N/2 points, i.e.,：

K=2r, k=2r+1, wherein r=0 are made, 1,2 ..., N/2-1, X (k) are separated by odd even sequence number, had：

If N/2 is still even number, then continue to decompose by such as upper type, untill two point DFT.

The butterfly computation flow of base 2 of the sequence X of N=16 points is as shown in figure 1, the FFT of 16 point sequence bases 2 is calculated and decomposed successively For 8 points, 4 points, 2 point DFT.Length is that the fft algorithm of N sequence base 2 needs to carry out log₂N levels, every grade again have n times butterfly fortune Calculate, in every grade of butterfly unit, former and later two treat that operational data is equally spaced, and carry out the butterfly computation of same structure, butterfly The data break of shape unit is N/2^j, wherein j be butterfly computation series, j=1,2 ... log₂N；Two number sums are stored back to again The original position of first data, the difference of two numbers and the product of butterfly coefficient are then stored back to the original position of second data, and what FFT was calculated should Characteristic is especially suitable for carrying out the parallel processing of data and realizes vector quantities operation in SIMD extension structure.

With the development of integrated circuit technique and performance requirement, single instruction stream multiple data stream (Single Instruction Multiple Data, SIMD) as the important extension of high-performance microprocessor, single-chip can also integrate more and more structure Functional unit.Using superscale or very long instruction word (Very LongInstruction Word, VLIW) structure, can make Multiple functional units carry out computing in a manner of SIMD to data, parallel to develop more instruction-levels, data level, so as to obtain more High operational performance.To make full use of multiplier in microprocessor arithmetic element, adder, computational efficiency, high-performance are improved Microprocessor generally support double access bandwidths (or more) parallel accessing operation.A FFT butterfly computation is normal except coefficient Number, it is also necessary to which a bat can provide two operands, therefore FFT computings need to provide operation using double access bandwidths of microprocessor Number.

Because the area and power consumption of capacity identical dual-port memory bank are usually 2 times of single port memory bank, and on piece Mass storage area and power consumption have strict limitation, therefore storage organization is typically chosen the integral number power that quantity is 2 on piece Individual single port memory bank, by low level crossing parallel mode tissue, double access bands can be provided with relatively low area and power consumption cost It is wide.But due to the discontinuity and symmetry for the treatment of operational data address of FFT butterfly computations, each butterfly computation all exists parallel Memory access conflict；Especially in SIMD extension structure, memory access conflict causes vectorial memory bandwidth service efficiency to reduce, FFT reality Computational efficiency will significantly be less than theoretical peak.

The content of the invention

The technical problem to be solved in the present invention is that：For technical problem existing for prior art, the present invention provides one Kind implementation method is simple, it is small towards FFT to eliminate memory access conflict in FFT parallel computations, memory access efficiency high and hardware consumption The Lothrus apterus storage access method of parallel computation.

In order to solve the above technical problems, technical scheme proposed by the present invention is：

A kind of Lothrus apterus towards FFT parallel computations stores access method, and step includes：

1) judge the structure of current processor, if SIMD architecture, be transferred to and perform step 3)；Otherwise it is transferred to execution step 2)；

2) a storage group storage operational data is configured, the storage group includes multiple parallel single port memory banks；Hold When row FFT is calculated, the linear address for treating operational data is mapped as two-dimentional Lothrus apterus memory access address, the two-dimentional Lothrus apterus memory access Address corresponds to treat the target storage volume where operational data and the address in target storage volume, according to the two-dimentional Lothrus apterus Memory access address carries out memory access to data；

3) multiple storage group storage operational datas are configured, each storage group includes multiple parallel single port memory banks；Hold When row FFT is calculated, the linear address for treating operational data is mapped as three-dimensional Lothrus apterus memory access address, the three-dimensional memory access address pair It should be and treat target storage group, target storage volume and the address in target storage volume where operational data, according to the three-dimensional Lothrus apterus memory access address carries out memory access to data.

As a further improvement on the present invention：Wherein P of multiple parallel single port memory banks deposit in the step 2) Storage body is addressed according to low level interleaved mode, and P is the odd number more than 3；Each storage group wherein P memory bank in step 3) Addressed according to low level interleaved mode, and P is the odd number more than 3.

As a further improvement on the present invention：The linear address for treating operational data is mapped according to the following formula in the step 2) For two-dimentional Lothrus apterus memory access address (X, Y)；

Wherein, Y is to treat the target storage volume position where operational data, and X is to treat operational data in target storage volume Row address, Addr is the linear address for treating operational data, and for W to treat operational data granularity, p is to intersect addressing using low level Memory bank number, mod represent modulo operation, and N is the sequence length that FFT is calculated.

As a further improvement on the present invention：The linear address for treating operational data is mapped according to the following formula in the step 3) For three-dimensional Lothrus apterus memory access address (X, Y, Z)；

Wherein, Y is to treat the target storage volume group position where operational data, and Z is to treat operational data in target storage group The position of target storage volume, X are the row address for treating operational data in target storage volume；Addr is to treat operational data linearly Location, the positive integer pwoer that G is SIMD width and G is 2, p are the memory bank number for intersecting addressing in every group of storage group using low level, Mod represents modulo operation, and N is the sequence length that FFT is calculated.

Compared with prior art, the advantage of the invention is that：

1) present invention is by the way that for non-SIMD architecture and double access microprocessors of SIMD architecture, single port is deposited respectively Storage body is organized as one-dimension storage group, more body organizational forms of two-dimensional storage group respectively, when performing FFT and calculating, by will be to be shipped The linear address branch for the evidence that counts is mapped as two-dimentional Lothrus apterus memory access address, three-dimensional Lothrus apterus memory access address, can effectively eliminate Memory access conflict in FFT computings, realizes the parallel memory access of Lothrus apterus of FFT computings, while improves FFT operation efficiencies.

2) present invention is directed to the microprocessor with SIMD extension structure, and the more bodies of single port memory bank are organized as by SIMD The two-dimensional storage array structure that mode operates, it would be preferable to support the Lothrus apterus memory access of the vectorization extension of FFT parallel algorithms, so as to big Width improves FFT operation efficiency.

3) present invention by will do not use SIMD architecture in treat operational data linear address be mapped as corresponding memory bank, The address in memory bank, corresponding storage group, memory bank will be mapped as using the linear address that operational data is treated in SIMD architecture And the address in memory bank, only change the calculation of memory access address, thus required hardware spending very little.

Brief description of the drawings

Fig. 1 is the realization principle schematic diagram for the base 2FFT butterfly computations that length is 16.

Fig. 2 is the implementation process schematic diagram that the present embodiment stores access method based on the Lothrus apterus towards FFT parallel computations.

Fig. 3 is the principle schematic diagram of memory bank tissue under non-SIMD architecture in the present embodiment.

Fig. 4 is the principle schematic diagram of memory bank tissue under SIMD architecture in embodiment.

Fig. 5 is the principle schematic diagram of memory bank tissue under non-SIMD architecture in the specific embodiment of the invention.

Fig. 6 is the principle schematic diagram of memory bank tissue under SIMD architecture in the specific embodiment of the invention.

Embodiment

Below in conjunction with Figure of description and specific preferred embodiment, the invention will be further described, but not therefore and Limit the scope of the invention.

As shown in Fig. 2 the present embodiment stores access method towards the Lothrus apterus of FFT parallel computations, step includes：

2) a storage group storage operational data is configured, storage group includes multiple parallel single port memory banks；Perform FFT During calculating, the linear address for treating operational data is mapped as two-dimentional Lothrus apterus memory access address, two-dimentional Lothrus apterus memory access address is corresponding To treat the target storage volume where operational data and the address in target storage volume, according to two-dimentional Lothrus apterus memory access address logarithm According to progress memory access；

3) multiple storage group storage operational datas are configured, each storage group includes multiple parallel single port memory banks；Hold When row FFT is calculated, the linear address for treating operational data is mapped as three-dimensional Lothrus apterus memory access address, three-dimensional memory access address corresponds to Target storage group, target storage volume and the address in target storage volume where operational data are treated, is visited according to three-dimensional Lothrus apterus Deposit address and memory access is carried out to data.

In the present embodiment, for double access store bandwidth demands of microprocessor, based on multiple single port memory bank sram (Static RAM) builds on-chip memory, and multiple single port memory bank sram support parallel memory access, with reduce area and Power consumption.

In the present embodiment, multiple parallel single port memory bank wherein P memory banks are according to low level intersection side in step 2) Formula is addressed, and P is the odd number more than 3；Each storage group wherein P memory bank enters according to low level interleaved mode in step 3) Row addressing, and P is the odd number more than 3.

The linear address for treating operational data is mapped as into two-dimentional Lothrus apterus according to formula (4) in the present embodiment, in step 2) to visit Deposit address (X, Y)；

Wherein, Y is to treat the target storage volume position where operational data, and X is to treat operational data in target storage volume Row address, Addr is the linear address for treating operational data, and for W to treat operational data granularity, p is to intersect addressing using low level Memory bank number, mod represent modulo operation, and N is the sequence length that FFT is calculated.Expression is taken less than or equal to Addr/W Maximum integer.

The present embodiment is not in the processor of SIMD architecture is used, it is assumed that memory span 2^HByte, H are positive integer, It is W bytes to treat operational data granularity, and assumes the positive integer pwoer that W is 2, and all memory banks are used or part is intersected using low level The structure of addressing, wherein using low level to intersect the memory bank of addressing for P (P is the odd number not less than 3).It is as shown in figure 3, whole The byte address of memory is H positions, is expressed as Addr [H-1:0], the data address in units for the treatment of operational data granularity is Data_Addr=Addr/W, actual address of the data in memory bank can use two-dimensional coordinate (X, Y) to represent, wherein Y is represented Actual memory access memory bank sram sequence number, X represent the row address where operational data is treated in choosing memory bank, the line of memory bank Property address Addr and actual address (X, Y) mapping relations such as formula (4) shown in, actual address (X, Y) as maps two obtained Tie up Lothrus apterus memory access address.

Above-mentioned memory bank organizational form is used in the processor for not using SIMD architecture, carries out the FFT that sequence length is N During computing (N is 2 positive integer pwoer), whole nothing can be realized during the butterfly computation by double access Parallel Implementation FFT Conflict accessing operation.

In the present embodiment, the linear address for treating operational data is mapped as three-dimensional Lothrus apterus memory access according to the following formula in step 3) Address (X, Y, Z)；

The present embodiment is in using SIMD architecture processor, it is assumed that SIMD width is G, and G is 2 positive integer pwoer, will not adopted The SIMD extension that quantity is G is carried out with the bank structure of SIMD architecture and obtains the memory bank of SIMD architecture, wherein single storage Structure still in whole or in part have low level intersect addressing structure, and wherein use low level intersect address memory bank for P ' individual (P ' is the odd number not less than 3), each memory bank width is W bytes.It is assumed that memory span is 2^HWord Section, it is W bytes to treat operational data width, and assumes the integer power that W is 2, sequence length N, the byte address of whole memory For H positions, Addr [H-1 are expressed as:0], the data address in units for the treatment of operational data granularity is Data_Addr=Addr/W, Actual address of the data in memory bank can use three-dimensional coordinate (X, Y, Z) to represent, wherein Y represents actual memory access address in G Position in individual region, Z represent the position in the memory bank of p, the region, and X then represents corresponding row address, whole or part Shown in the linear address Addr and actual address (X, Y, Z) of memory bank mapping relations such as formula (5), actual address (X, Y, Z) is i.e. To map obtained three-dimensional Lothrus apterus memory access address.

Above-mentioned memory bank organizational form is used in the processor using SIMD architecture, FFT vectors are carried out by double access When changing concurrent operation, it is possible to achieve whole Lothrus apterus performs.

Below using operational data width W as 4, the present invention is further illustrated exemplified by sequence length N.

As shown in figure 5, in the present embodiment under non-SIMD architecture, low level is carried out using P=3 memory bank and intersects addressing, Operational data width W is 4, and sequence length N, two-dimentional Lothrus apterus memory access address is represented with coordinate (X, Y), and wherein Y represents that target is visited Memory bank sram sequence number is deposited, X represents to treat row address of the operational data in target storage volume, will treat operational data linearly Location Addr is mapped as two-dimentional Lothrus apterus memory access address (X, Y) by formula (6)：

Wherein, Y represents to treat operational data position in 3 memory banks, and X represents to treat that operational data is corresponding in memory bank Row address.

As shown in fig. 6, in the present embodiment under SIMD architecture, it is low using P=3 memory bank progress in every group of memory bank group Position intersects addressing, and operational data width W is 4, and sequence length N, SIMD width takes 16, then whole memory bank group is divided into 16 Region, three-dimensional Lothrus apterus memory access address is represented with coordinate (X, Y, Z), and the linear address Addr for treating operational data is reflected by formula (7) Penetrate as three-dimensional Lothrus apterus memory access address (X, Y, Z)：

Wherein, Y represents to treat operational data position in 16 regions, and Z represents to treat operational data in 3, region memory bank In position, X then represents to treat operational data corresponding row address in memory bank.

Above-mentioned simply presently preferred embodiments of the present invention, not makees any formal limitation to the present invention.It is although of the invention It is disclosed above with preferred embodiment, but it is not limited to the present invention.Therefore, it is every without departing from technical solution of the present invention Content, according to the technology of the present invention essence to any simple modifications, equivalents, and modifications made for any of the above embodiments, it all should fall In the range of technical solution of the present invention protection.

Claims

1. a kind of Lothrus apterus towards FFT parallel computations stores access method, it is characterised in that step includes：

1) judge the structure of current processor, if SIMD architecture, be transferred to and perform step 3)；Otherwise it is transferred to and performs step 2)；

2) a storage group storage operational data is configured, the storage group includes multiple parallel single port memory banks；Perform FFT During calculating, the linear address for treating operational data is mapped as two-dimentional Lothrus apterus memory access address, the two-dimentional Lothrus apterus memory access address Correspond to treat the target storage volume where operational data and the address in target storage volume, visited according to the two-dimentional Lothrus apterus Deposit address and carry out data memory access；

3) multiple storage group storage operational datas are configured, each storage group includes multiple parallel single port memory banks；Perform FFT During calculating, the linear address for treating operational data is mapped as three-dimensional Lothrus apterus memory access address, the three-dimensional Lothrus apterus memory access address Correspond to treat target storage group, target storage volume and the address in target storage volume where operational data, according to described Three-dimensional Lothrus apterus memory access address carries out data memory access；

The wherein P memory bank of multiple parallel single port memory banks is compiled according to low level interleaved mode in the step 2) Location, and P is the odd number more than 3；Each storage group wherein P memory bank is addressed according to low level interleaved mode in step 3), And P is the odd number more than 3；

The linear address for treating operational data is mapped as two-dimentional Lothrus apterus memory access address (X, Y) according to the following formula in the step 2)；

Wherein, Y is to treat the target storage volume position where operational data, and X is the row ground for treating operational data in target storage volume Location, Addr are the linear address for treating operational data, and for W to treat operational data granularity, p is the memory bank for intersecting addressing using low level Number, mod represent modulo operation, and N is the sequence length that FFT is calculated；

In the step 3) by the linear address for treating operational data be mapped as according to the following formula three-dimensional Lothrus apterus memory access address (X, Y, Z)；

Wherein, Y is to treat the target storage group position where operational data, and Z is to treat that operational data target in target storage group is deposited The position of body is stored up, X is the row address for treating operational data in target storage volume；Addr is the linear address for treating operational data, and G is The positive integer pwoer that SIMD width and G are 2, p are the memory bank number for intersecting addressing in every group of storage group using low level, and mod is represented Modulo operation, N are the sequence length that FFT is calculated.