WO2005052808A1 - Processeur fft pipeline avec entrelacement d'adresses de memoire - Google Patents

Processeur fft pipeline avec entrelacement d'adresses de memoire Download PDF

Info

Publication number
WO2005052808A1
WO2005052808A1 PCT/CA2004/002049 CA2004002049W WO2005052808A1 WO 2005052808 A1 WO2005052808 A1 WO 2005052808A1 CA 2004002049 W CA2004002049 W CA 2004002049W WO 2005052808 A1 WO2005052808 A1 WO 2005052808A1
Authority
WO
WIPO (PCT)
Prior art keywords
butterfly
samples
series
memory
fft
Prior art date
Application number
PCT/CA2004/002049
Other languages
English (en)
Inventor
Sean G. Gibb
Peter J. W. Graumann
Original Assignee
Cygnus Communications Canada Co.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CA 2451167 external-priority patent/CA2451167A1/fr
Application filed by Cygnus Communications Canada Co. filed Critical Cygnus Communications Canada Co.
Publication of WO2005052808A1 publication Critical patent/WO2005052808A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm

Definitions

  • the present invention generally relates to fast Fourier transform (FFT) processors. More particularly, the present invention relates to pipelined FFT processors with modified butterfly units.
  • FFT fast Fourier transform
  • the discrete Fourier transform (DFT) implementation of the FFT is an important block in many digital signal processing applications, including those which perform spectral analysis or correlation analysis.
  • the purpose of the DFT is to compute the sequence ⁇ ( ⁇ ) ⁇ , having N complex-valued numbers, given another sequence ⁇ x(n) ⁇ also of length N, where
  • the computation of the DFT is performed by decomposing the DFT into a sequence of nested DFTs of progressively shorter lengths. This nesting and decomposition is repeated until the DFT has been reduced to its radix. At the radix level, a butterfly operation can be performed to determine a partial result which is provided to the other decompositions. Twiddle factors, which are used to perform complex rotations during the DFT calculation, are generated as the divide-and-conquer algorithm proceeds. For a radix-2 decomposition, a length-2 DFT is performed on the input data sequence
  • FFT processors performing the above process are commonly implemented as dedicated processors in an integrated circuit. Many previous approaches have improved the throughput of FFT processors while balancing latency against the area requirements through the use of a pipeline processor- based architecture. In a pipeline processor architecture, the primary concern from the designer's perspective is increasing throughput and decreasing latency while attempting to also minimize the area requirements of the processor architecture when the design is implemented in a manufactured integrated circuit.
  • a common pipeline FFT architecture achieves these aims by implementing one length-2 DFT (also called a radix-2 butterfly) for each stage in the DFT recombination calculation. It is also possible to implement less than or more than one butterfly per recombination stage. However, in a real-time digital system, it is sufficient to match the computing speed of the FFT processor with the input data rate. Thus, if the data acquisition rate is one sample per computation cycle, it is sufficient to have a single butterfly per recombination stage.
  • a brief review of pipeline FFT architectures in the prior art is provided below, in order to place the FFT processor of this invention into perspective. In this discussion, designs implementing the radix-2, radix-4 and more complex systems are described.
  • Input and output order is assumed to be the most appropriate form for the particular design. If a different order is required, an appropriate re-ordering buffer (consisting of both on-chip memory and control circuits) can be provided at the input or output of the pipeline FFT, which is noted as a "cost" of implementation as that adds complexity or uses additional area on chip. FFT implementations that accept in-order input data are most suitable for systems where data is arriving at the FFT one sample at a time. This includes systems such as wired and wireless data transmissions systems. Out-of-order input handling is most appropriate when the input data is buffered and can be pulled from the buffer in any order, such as in an image analysis system.
  • the R2MDC approach breaks the input sequence into two parallel data streams.
  • a commutator 102 receives the data stream as input and delays half of the data stream with memory 104. The delayed data is then processed with the second half of the data stream in a radix-2 butterfly unit 106. Part of the output of the butterfly unit 106 is delayed by buffering memory 108 prior to being sent to the next butterfly module. In each subsequent butterfly module the size of both memory 104 and 108 are halved.
  • the processor of Figure 1 implements a 16-point R2MDC pipeline FFT. In terms of efficiency of design, the multipliers and adders in the R2MDC architecture are 50% utilized.
  • the R2DMC architecture requires 3/2 N-2 delay registers.
  • a Radix-4 Multi-path Delay Commutator (“R4MDC”) pipeline FFT is a radix-4 version of the R2MDC, where the input sequence is broken into four parallel data streams.
  • R4MDC architecture's multipliers and adders are 25% utilized, and the R4MDC designs require 5/2 N-4 delay registers.
  • An exemplary 256-point R4MDC pipeline implementation is shown in Figure 2.
  • the FFT processor of Figure 2 is composed of butterfly modules, such as butterfly module 110.
  • Butterfly module 110 includes commutator 112 with an associated memory 114, butterfly unit 116 and an associated memory 118.
  • the commutator 112 orders samples and stores them in memory 114.
  • R2SDF Radix-2 Single-path Delay Feedback
  • FIG. 3 shows the basic architecture of a prior art R2SDF for a 16-bit FFT.
  • a butterfly module is composed of the radix-2 butterfly unit, such as butterfly unit 120, and its associated feedback memory 122.
  • the size of the memory 122a-122d in a butterfly module varies with the position of the module in the series.
  • Butterfly unit 120 receives an input series of 16 samples, and buffers the first 8 samples in feedback memory 122a. Starting with the ninth sample in the series, butterfly unit 120 serially pulls the stored samples from feedback memory 122a and performs butterfly operations on the pair-wise samples. The in order output is provided to the next butterfly module by storing out of order outputs in the feedback memory 122a until they can be provided in order.
  • a Radix-4 Single-path Delay Feedback (“R4SDF”) pipeline FFT is a radix-4 version of the R2SDF design.
  • the utilization of the multipliers increases to 75% in implementation, but the adders are only 25% utilized, while the design will require N-1 delay registers.
  • the memory storage is fully utilized.
  • a 256-point R4SDF pipeline example from the prior art is shown in Figure 4.
  • the structure of the processor of Figure 4 is similar to that of Figure 3, with butterfly modules being composed of a radix-4 butterfly unit, such as BF4 124, and an associated feedback memory 126.
  • the size of feedback memory 126 decreases from 126a- 126d in accordance with the amount of separation required between samples.
  • FIG. 4 The butterfly modules of Figure 4 function in the same fashion as those of Figure 3, with additional samples being stored in feedback memory 126 in each cycle.
  • a Radix-4 Single-path Delay Commutator (“R4SDC”) uses a modified radix-4 algorithm to achieve 75% utilization of multipliers, and has a memory requirement of 2N-2.
  • a prior art 256-point R4SDC pipeline FFT is shown in Figure 5.
  • Figure 5 has single input single output butterfly modules, such as butterfly module 127.
  • commutator 128 In butterfly module 127 a single input is provided to commutator 128 which stores and reorders samples using an internal memory.
  • Commutator 128 provides the samples four at a time to radix four butterfly unit 129. The output of butterfly unit 129 is serially provided to the next butterfly module.
  • R2 2 SDF Single-path Delay Feedback
  • Butterfly modules are composed of butterfly units such as BF21 130 and an associated feedback memory such as memory 131.
  • Butterfly unit 130 receives a series of input samples and buffers the first set of samples in memory 131 , then performs pairwise butterfly operations using stored samples and the incoming series.
  • the operation of this processor is functionally similar to that of the processor of Figure 4 with the differences noted above.
  • a single path delay fast Fourier transform (FFT) processor for performing an FFT on a series of input samples organized as pairs.
  • the processor comprises a first butterfly unit, an interleaver and a second butterfly unit.
  • the first butterfly unit receives the series of input samples, and performs a first butterfly operation on each received pair of samples to provide a serial output.
  • the interleaver receives the serial output, and permutes samples in the serial output to provide a permutation as a pairwise series of samples.
  • the second butterfly unit serially receives the pairwise series of samples from the interleaver, and performs a second butterfly operation on each pair of samples in the pairwise series to obtain an output series of samples corresponding to an FFT of the series of input samples.
  • the second butterfly unit is a modified butterfly unit and includes a set of adders for receiving real and imaginary components of each sample, and for performing the second butterfly operation using the received real and imaginary components of each sample.
  • the first butterfly unit is a modified butterfly unit including a multiplexer and a set of adders. The multiplexer receives the series of input samples, swaps real and imaginary components of selected samples and provides the selectively swapped components as an output.
  • the set of adders performs the first butterfly operation using the selectively swapped components from the multiplexer.
  • the multiplexer is preferably controlled by a modulo counter to perform component swapping on one half of the input samples of the received series.
  • the processor includes a modified butterfly unit and a further interleaver.
  • the modified butterfly unit receives the series of input samples, and performs a modified butterfly operation on each received pair of samples to provide a serial output.
  • the further interleaver receives the serial output of the modified butterfly unit, permutes the samples in the serial output of the modified butterfly to provide the permuted samples as the input series to the first butterfly module.
  • the modified butterfly unit includes a multiplexer for selectively swapping real and imaginary components of the pairs of samples, a set of adders, for performing the modified butterfly operation using the selectively swapped components from the multiplexer, and a constant multiplier for selectively applying a constant twiddle factor to the result of the modified butterfly operation and for providing the selectively multiplied result to the further interleaver.
  • the first and second butterfly modules are both multiplierless butterfly units for performing butterfly operations on the received pairs of samples.
  • the interleaver includes an addressable memory for receiving and storing the serial output of the first butterfly module, and an address generator for generating memory addresses at which each result from the first butterfly can be stored.
  • the addressable memory is preferably sized to store one half of the serial output of the first butterfly module.
  • the interleaver preferably further includes a complete compressing permuter for providing the address generator with memory addresses for the first half of the serial output of the first butterfly module.
  • the complete compressing permuter preferably generates an address for the xth sample in accordance with the formula
  • the complete compressing permuter includes a compressing x permuter for determining an address in accordance with 2" + f(x mod 2""' ) and _ 2TM multiplexer for switching between the address determined by the compressing permuter and an address determined in accordance with the position of the sample in the serial output.
  • the address generator preferably includes a sequence permuter for shifting the address generated by the complete compressing permuter to prevent overwriting data not provided to the second butterfly unit.
  • the processor comprises a plurality of butterfly modules connected in series each having a memory for receiving a series of samples and an associated butterfly unit for performing butterfly operations on the series of samples in the memory, the first butterfly module in the plurality for receiving and storing the series of input samples in memory, the final butterfly module in the plurality for providing a butterfly operation output as a series of samples corresponding to an FFT of the series of input samples.
  • At least one of the plurality of butterfly modules has an interleaving memory for receiving and storing a series of samples, and for permuting the samples to obtain a pairwise series of samples, and for serially providing an associated buttefly unit with the permuted pairwise series.
  • the interleaving memory serially receives the series of samples from a previous butterfly module.
  • the present invention provides a novel real-time pipeline FFT processor design, optimized to reduce butterfly processor area, namely the multiplier and adder units in the butterfly implementation.
  • Figure 1 is a block diagram illustrating a radix-2 multipath delay commutator pipelined FFT processor of the prior art
  • Figure 2 is a block diagram illustrating a radix-4 multipath delay commutator pipelined FFT processor of the prior art
  • Figure 3 is a block diagram illustrating a radix-2 single path delay feedback pipelined FFT processor of the prior art
  • Figure 4 is a block diagram illustrating a radix-4 single path delay feedback pipelined FFT processor of the prior art
  • Figure 5 is a block diagram illustrating a radix-4 single path delay commutator pipelined FFT processor of the prior art
  • Figure 6 is a block diagram illustrating a radix-2 2 single path delay feedback pipelined FFT processor of the prior art
  • Figure 8 illustrates a general butterfly module of the present invention
  • the present invention provides an FFT processor architecture using a modified butterfly unit.
  • the modified butterfly unit can provide a reduction in the implementation area by maximizing utilization of components while removing unnecessary components.
  • An interleaving memory architecture is further provided by the present invention to allow for a further reduction in implementation area.
  • the FFT processor of the present invention uses an interleaving memory structure to receive samples out of order, and to permute them so that they are provided to the butterfly unit in the required order. This reduces the memory requirement for the butterfly unit.
  • the interleaver is preferably used to connect two butterfly units, so that it recieves out of order samples from one unit and provides in order samples to the other.
  • the first butterfly unit receives a series of input samples organized as pairs, and performs a butterfly operation on each pair, providing the output to the interleaver.
  • the second butterfly unit serially recieves pairs of samples from the interleaver, performs a butterfly operation on the pairs of samples, and provides as an output, a series of samples corresponding to the FFT of the series of input samples.
  • the present invention provides an FFT processor having a plurality of serially connected butterfly modules. Each butterfly module receives the output of the previous module, with the first module receiving the input series of samples. The final butterfly module provides its output as a series of samples corresponding to an FFT of the series of input samples.
  • At least one of the butterfly modules in the plurality includes an interleaving memory which receives samples out of order, and provides them to the associated butterfly unit in the required order.
  • the present invention can best be understood through a cursory examination of the data flows of an FFT and understanding the implications of these data flows in processor architecture.
  • the input sequence is segmented to restrict the pairings of samples.
  • the lower half of each butterfly is multiplied by twiddle factor W k .
  • W k the lower half of each butterfly is multiplied by twiddle factor W k .
  • stage 3 the final stage, only W 0 is applied as a twiddle factor.
  • stage 2 either W 0 or W 4 is applied, and in stage 1 one of W 0 , W 2 , W 4 and W 6 is applied.
  • Both input values are also provided to adder 138, after input b is sign inverted, and the output of adder 138 is provided to multiplier 140, which multiplies the output by a twiddle factor W k .
  • the present invention provides modified butterfly units based upon optimizations related to the twiddle factor values, W k . These optimizations can reduce the physical implementation of the circuit embodying this form in the last stages of the FFT.
  • a functional block diagram of the implementation of a DIF FFT processor of the present invention is shown in Figure 9. As with previous FFT processors, the FFT processor of Figure 9 is implemented as a series of stages, each stage corresponding to a butterfly module.
  • the final stage of the processor is provided by butterfly module 142, the penultimate stage by butterfly module 144 and the third last stage by butterfly module 146.
  • the butterfly module 146 is optionally preceded by a plurality of butterfly modules 148, the number selected in accordance with the length of the FFT that is to be computed.
  • the initial butterfly unit 150 is preceded by the source 152. It is assumed that the source provides the input series of samples in the order required by BF2n 150.
  • two basic units are employed: butterfly units 154, 158, 162 and 166 respectively, and interleaver memories 156, 160, 164, and 168.
  • An interleaver memory is also referred to as a permuter, as it has a single input and the interleaving of a single channel is functionally equivalent to the permutation of the channel contents. Due to the use of permuters, the architecture of Figure 9 is referred to herein as a Radix-2 Single-path Delay Permuter ("R2SDP") design.
  • R2SDP Radix-2 Single-path Delay Permuter
  • the system of Figure 9 provides three modified butterfly modules 142, 144 and 146, connected in series. Each of the modified butterfly modules includes an interleaving memory for receiving the output of the previous stage and for permuting the received output into the order required for the associated modified butterfly unit.
  • three modified butterfly units BF2
  • These three modified butterfly modules are optionally preceded by a series of general butterfly modules 148 and a butterfly unit 150 that receives the input sequence.
  • preceding the modified butterfly modules by other butterfly modules allows for longer length FFTs to be computed.
  • Interleaver memory units 156, 160, 164 and 168 are also included in the butterfly modules 141, 144, 146 and 148 respectively.
  • the interleaver memory units are named using the nomenclature l ran where r is the radix of the interleaver (in this example, 2) and n is the number of values interleaved in a single operation. Note that n may take a value between 2 in the first stage's interleaver and N in the last stage's interleaver. The actual memory requirements for the memory interleaver stage is n/2. Larger FFTs simply have additional BF2n butterflies and memory interleaver units (each requiring twice as much storage as the previous interleaver). For the purpose of this disclosure, the data acquisition rate is assumed to be one sample per cycle.
  • butterfly unit 154 requires four registers (two registers per input, allowing storage of the real and imaginary components of a sample) and two adder units.
  • An exemplary implementation of butterfly unit 154 is provided in Figure 10. The description of Figure 10 is best understood in combination with the signal timing diagram of Figure 11 which is also used to illustrated the utilization of the hardware components of the embodiment of Figure 10.
  • registers R0 170 and R1 174 receive the real and imaginary components of the i th sample respectively.
  • registers R2 172 and R4 176 receive the real and imaginary components of the i th +1 sample respectively.
  • adder A0 178 sums the contents of register R0 170 and the real component of the i h +1 sample while adder A1 180 sums the contents of register R1 174 and the imaginary component of the i h +1 sample.
  • adder AO 178 takes the difference between the contents of registers RO 170 and R2 172
  • Adder A1 180 takes the difference between the contents of registers R1 174 and R3 176.
  • all registers 170, 172, 174 and 176 are emptied, and the i th +2 sample arrives for storage in registers R0 170 and R1 174.
  • the butterfly operation preferably provides the output of the butterfly operation on the two samples in 2 clock cycles to maintain timing and data flow.
  • butterfly module 144 can also have a modified butterfly unit
  • FIG. 12 illustrates an exemplary embodiment of the modified butterfly BF2 M 158.
  • BF2 M 158 operates in two modes, one for each of the coefficients. In the first mode, the circuit behaves exactly as BF2
  • BF2 N 158 has the same hardware requirements and utilization as in the multiplierless radix-2 butterfly (four registers and two adder units). However, to permit the real-imaginary component swapping required, additional multiplexers are provided on the four adder inputs in order to steer signals to perform the real-imaginary swap when the coefficient -j is applied.
  • a signal diagram in Figure 13 shows the signal characteristics of the R2SDP BF2 M butterfly with multiplication by both coefficients.
  • the operation of the butterfly unit of Figure 12 is best illustrated in conjunction with the timing diagram of Figure 13.
  • the butterfly receives the real and imaginary components of the i th sample and stores them respectively in registers RO 182 and R1 186.
  • registers R2 184 and R3 188 receive the real and imaginary components of the i h +1 sample.
  • adder AO 190 sums the contents of R0 182 and the real component of the i th +1 sample
  • adder A1 192 sums the contents of R1 186 with the imaginary component of the i th +1 sample.
  • Adder A0 190 takes the difference between the contents of R0 182 and R2 184
  • Adder A1 192 takes the difference between the contents of R1 186 and R3 188.
  • the butterfly operation is achieved without the use of a dedicated multiplier through the use of sign and component inversion.
  • the contents of registers R0 182, R1 186, R2 184 and R3 188 are emptied to receive the next pairwise samples.
  • the multiplexer control can then by handled by a simple modulus-N/2 counter.
  • the butterfly unit of the present invention is preferably preceded by an interleaver that groups data samples together so that all samples requiring a particular twiddle factor are provided to the butterfly unit in a continuous block.
  • BF2 4 166 is a general purpose butterfly unit.
  • This optionally implemented butterfly unit is used in the FFT processor of Figure 9, in conjunction with properly sized interleavers, such as interleaver 168 to form the general purpose butterfly module 148, which is added to the FFT processor illustrated in Figure 9 to allow for processing larger FFTs.
  • the same general butterfly unit is implemented as BF27 150, as described in Figure 8, which receives the input sequence of samples from a source 152.
  • BF2n 150 performs a single complex multiplication during each operation.
  • a complex multiplication is comprised of four real multiplications and two real additions. Since data is being provided at one sample per clock cycle and a radix-2 butterfly requires two samples, two clock cycles are available to complete the complex multiplication and hence two real multipliers and a real adder are sufficient to the task of ensuring the one sample per clock cycle design assumption or criteria is met. As with the previously disclosed two modified butterfly units, BF2 M 158 and BF2
  • adder A0 sums the contents of R0 and the real component of the i th +1 input
  • A1 sums the contents of R1 and the imaginary component of the i th +1 input
  • A2 computes the difference between the contents of R0 and the real component of the i th +1 input.
  • Multiplier M0 computes the product of the output of A2 and C(i/2), while M1 computes the product of the output of A2 and S(i/2).
  • R0 receives the imaginary component of the i th +2 input, R2 receives the output of M0, while R3 receives the output of M1.
  • the real component of the output is AO, while the imaginary component of the output is A1.
  • adder AO takes the difference between the contents of R2 and M1
  • A1 sums the contents of R3 and MO
  • A2 takes the difference between the contents of R1 and RO.
  • MO and M1 take the same products that they did before, but with the new A2 contents.
  • RO and R1 receive the real and imaginary components of the i th +2 sample.
  • the real and imaginary outputs of the butterfly unit are AO and A1 respectively.
  • adder AO sums the contents of register RO and the real component of the i th +3 input
  • A1 sums the contents of register R1 and the imaginary component of the i th +3 input
  • A2 takes the difference between the contents of register R0 and the real component of the i th +3 input.
  • Multiplier M0 computes the product of the contents of A2 and C(i/2+1 ) and M1 computes the product of the contents of A2 and S(i/2+1 ).
  • Register R0 receives the imaginary component of the i th +3 input, R2 receives the result of multiplier M0, and R3 receives the output of M1.
  • the real and imaginary components of the output signal are A0 and A1 respectively. From the flow diagram of Figure 7, the stage of the FFT performed by BF2m 162, requires four coefficients as defined by the equation:
  • the two multiplierless coefficients as in the BF2 ⁇ 158 butterfly, are present.
  • multiplication by the two additional complex coefficients can be implemented using an optimized single constant multiplier and a subtractor, rather than the two multipliers and adder-subtractor for the complex multiplication as in BF27 150.
  • An implementation utilizing a single constant multiplier and a subtractor provides a simpler implementation with a reduced area.
  • the signal diagram of Figure 15 illustrates the operational requirements of a circuit required to implement BF2m 162.
  • One skilled in the art will appreciate that such a circuit can be implemented without undue experimentation. There are four different states, or operational modes, shown in Figure 15, one for each of the four coefficient multiplications that this butterfly must perform.
  • the coefficients are preferably ordered in a bit-reversed fashion because the input sequence will be coming into this stage in bit-reversed order.
  • these modes are clustered such that the butterfly unit will perform N/4 operations before switching to the next coefficient multiplication mode. This clustering can be achieved by the proper interleaving of the samples in I ⁇ N/S 164.
  • registers RO and R1 receive the real and imaginary components of the i th sample.
  • W k ⁇ . This corresponds to the second and third clock cycles.
  • adder AO sums the contents of RO with the real component of the i th +1 input sample
  • A1 sums the contents of R1 and the imaginary component of the i th +1 input sample.
  • Registers R2 and R3 receive the real and imaginary components of the i th +1 sample respectively.
  • the real and imaginary components of the output are A0 and A1 respectively.
  • A0 takes the difference between the contents of R0 and R2
  • A1 takes the difference between R1 and R3.
  • R0 and R1 receive the real and imaginary components of the i th +2 input sample respectively.
  • adder A0 sums the contents of R0 with the real component of the i th +3 input sample
  • A1 sums the contents of R1 and the imaginary component of the i t +3 input sample.
  • Registers R2 and R3 receive the real and imaginary components of the i th +3 sample respectively.
  • A0 takes the difference between the contents of R1 and R3, while A1 takes the difference between R2 and R0.
  • R0 and R1 receive the real and imaginary components of the i t +4 input sample respectively.
  • adder A0 sums the contents of R0 with the real component of the i th +5 input sample
  • A1 sums the contents of R1 and the imaginary component of the i th +5 input sample
  • A2 takes the difference between contents of R0 with the real component of the i th +5 input sample.
  • Multiplier M0 multiplies the constant value by the contents of A2.
  • Register R0 receives the real component of the i th +5 input sample
  • R2 receives the output of MO.
  • Adder AO sums the contents of R2 and MO
  • A1 takes the difference between the contents of MO and R2
  • A2 takes the difference between R1 and RO.
  • Multiplier MO multiplies the constant value by the contents of A2.
  • RO and R1 receive the real and imaginary components of the i th +6 sample.
  • adder A0 sums the contents of R0 with the real component of the i th +7 input sample
  • A1 sums the contents of R1 and the imaginary component of the i th +7 input sample
  • A2 takes the difference between the real component of the i th +7 input sample and the contents of R0.
  • Multiplier M0 multiplies the constant value by the contents of A2.
  • Register R0 receives the real component of the ith+7 input sample, and R2 receives the output of M0.
  • Adder A0 takes the difference of the contents of R2 and M0
  • A1 sums the contents of M0 and R2
  • A2 takes the difference between R0 and R1.
  • Multiplier M0 multiplies the constant value by the contents of A2.
  • the architectures of the above described modified butterflies allow for an implementation in a reduced area as there has been a reduction in the number of components required. Furthermore, the reduction in the component count can be used to decrease the power consumption of the FFT processor in operation.
  • the coefficient clustering in an out-of-order input FFT reduces the switching requirements of the block, resulting in reduced power consumption for the FFT over in-order architectures.
  • the clustering is achieved by selection of an interleaver that provides samples to the butterfly unit in such an order that all pairs of samples requiring the same coefficient are provided as contiguous groups.
  • the interleaver architecture described in the following part was developed by considering the operation of the butterfly units, which accept a single complex input each clock cycle and generate a single complex output each clock cycle.
  • the output data for one stage is passed into a memory interleaver block, such as interleavers 156, 160 and 164, as shown in Figure 9, and after the appropriate memory storage period, is then removed and used by the next butterfly stage to perform the butterfly operation required.
  • the input to the FFT processor is assumed to come in bit-reversed form, so for instance the signal x(0) will arrive first, followed by the signal x(8).
  • the timing diagram in Figure 16 shows the data flow of a 16-point FFT with signal timing information for an R2SDP FFT implementation. Note that each signal takes the general form x s (t) where s is the signal's stage and t is the zero-based arrival time for that signal in its stage.
  • Stage 0 signals 1 cycle apart, such as x 0 (0) and x 0 (1), are combined in a butterfly to produce two results.
  • Stage 1 signals 2 cycles apart are combined and Stage 2 signals that are separated by 4 clock cycles are combined. This pattern of doubling the signal separation continues in the FFT until the final butterfly stage is reached, at which point a delay of N/2 cycles is required in order to perform the final butterfly.
  • N/2 cycles is required in order to perform the final butterfly.
  • M registers or RAM entries are required to generate the delay.
  • the I 2XN/2 interleaver 160 would be an l 2 ⁇ memory interleaver block.
  • interleaver 160 In designing an interleaver, several considerations must be taken into account.
  • One objective of the interleaver of the present invention is to avoid both large numbers of storage elements and complex memory addressing systems used to ensure that a storage element is not re-used until its contents have been read out.
  • the interleaver presented below reduces the number of required storage elements, or memory locations, to V ⁇ the size of the data sequence length. Thus, 8 samples can be interleaved in the l 2x8 using only 4 memory locations (assuming that each sample is sized to fit in one memory location).
  • a signal timing diagram for an l 2x8 interleaver, such as interleaver 160, is shown in Figure 17.
  • the l 2x8 memory interleaver 160 allows signals four clock cycles apart to be butterflied together by storing the first four signals that enter the interleaver and then by interleaving these stored signals with the next four signals that enter the block.
  • a general interleaver block the first n/2 signals are stored and then are interleaved with the next n/2 signals.
  • the general input pattern of Xo, Xi, x 2 x n /2- ⁇ . *n/2. x n /2+ ⁇ .---. *n- ⁇ is permuted to provide the interleaver output pattern of Xo, Xn/2 > i. x n /2 +i>--- , X ⁇ /2-1.
  • the input sequence x 0 (0), x 0 (1 ), x o (2), Xo(3), x 0 (4), Xo(5), x 0 (6), x 0 (7) is interleaved to produce the output sequence x 0 (0), x 0 (4), x 0 (1 ), Xo(5), Xo(2), Xo(6), x 0 (3), x 0 (7).
  • the first four symbols are placed into memory locations determined by the sequential addresses 0, 1 , 2, and 3 in the first four clock cycles. As the fifth input symbol arrives into the interleaver, the first input symbol which was stored in address 0 is being read and removed.
  • x 0 (4) can be placed into memory address 0 overwriting the now stale contents.
  • the memory is a dual port register file, having unique read and write ports.
  • the remaining three inputs, x 0 (5) through x 0 (7), are placed in memory locations as those locations become available.
  • the final input address pattern for the eight incoming signals is 0, 1 , 2, 3, 0, 1 , 0, 0.
  • the first symbol in the second set of input data, x ⁇ O) will need to go into the available memory location which is address 2.
  • the remaining three entries for the first half of the input data will go into the available memory locations 1 , 3, and 0.
  • the remaining four incoming data values, x ⁇ 4) through X ⁇ (7), will follow a similar pattern to the second half of the previous eight input values.
  • the resulting input address pattern for the second eight incoming values is 2, 1 , 3, 0, 2, 2, 1 , 2.
  • the third set of eight incoming values has a new order, the overall pattern is periodic and repeats every log 2 N input patterns. A sequence of n input data is broken into two distinct sequences in the interleaver.
  • the first n/2 input data values fill the available n/2 memory locations from the previous operation and the second n/2 input values fill the available n/2 memory locations from the current operation.
  • These two sets of n/2 input data are interleaved together performing a single interleave operation that produces one output symbol per cycle to match the data rate of the incoming data stream.
  • the addresses of the second half of the input data relative to the addresses filled in the first half of the operation, follow a very distinct pattern. In order to observe this result, consider the first memory interleaving operation described above (i.e. 0, 1 , 2, 3, 0, 0, 1 , 0).
  • the addresses of the second half of the input data can also be described in terms of relationship position to previous inputs.
  • the signals x 0 (4), Xo(5), x 0 (7) go into the memory position of the original input signal Xo(0).
  • the signal x 0 (6) goes into the memory position of the original input signal x 0 (1 ).
  • the same behavior is observed in the second set, and all remaining sets, of input data.
  • the first four inputs of the second input data set, x-,(0) through X ⁇ (3) can be compared with the first four inputs of the first input data set, x 0 (0) through x 0 (3).
  • Signal x ⁇ O) follows signal x 0 (2);
  • signal x ⁇ 1 follows signal Xo(1) and so forth.
  • the l 2x8 memory interleaver can be extended to length M patterns for an l 2xM memory interleaver.
  • the addresses used by the interleaver are described by the sequence 0, 0, 1 , 0, 2, 1 , 3, 0, ... which appears in Sloane's Encyclopedia of Integer Sequences as sequence A025480. This sequence is described by the
  • log 2 N sequence permuters 198a-198c are then used to handle the placement of the first n/2 signals by offsetting the output of compressing permuter 194 to account for the transition pattern.
  • Each of the sequence permuters 198a-c offsets the value of the previous permutation thereby allowing successively later input permutation sequences to be placed in the correct memory location (without overwriting unused data).
  • the output of the complete compressing permuter 200 is fed directly into the first sequence permuter 198a.
  • the addition of the term to the compressing permuter allows the data to be set up such that the sequence permuters will produce the correct results across all input signal values x.
  • Complete compressing permuter 200 uses multiplexer 196 to switch between the two states, and is described in more detail below.
  • sequence permuter 198a-c The addition of 2 m"1 and the input value to the sequence generating equation is the same as looking forward in the generated sequence by N/2 values. In terms of the previous example, this permuter generates the sequence 2, 1 , 3, 0, which is the address translation sequence described above.
  • the address generator 193 requires m serially connected sequence generators in order to produce the output pattern for all sequences until repetition is encountered.
  • the equation below describes the remainder of the address generation
  • This implementation is preferably connected between an address counter and the address lines of a memory unit, such as a dual-port register file.
  • the compressing permuter 194 and the following multiplexer 196 implement the equation c m (x) and form complete compressing permuter 200.
  • the output of complete compressing permuter 200 serves as the input to the remaining blocks as seen in the equation for p m (x) where r m (x) implements the sequence permuters 198a-c and final multiplexer 202 of the complete sequence permuter 204.
  • the complete sequence permuter 204 there are m sequence permuters 198a-c, each of which implements the equation for s m (x).
  • Shifters 214, 216 and 218 are each receive 0 and 1 as inputs to the 1 and 0 data ports of a first multiplexer. The number of multiplexers in each shifter increases from shifter 214 having 1 multiplexer to shifter 218 having m multiplexers.
  • FIG. 21 illustrates the use of the address generator 193 of the present invention in an interleaver memory such as interleavers 156, 160, 164 and 168.
  • the interleaver contains both interleaver controller 220 and a plurality of memory cells, or storage elements 222.
  • Interleaver controller 220 determines a storage address for each incoming sample, reads out the data in the storage address, and sends the received sample to the determined storage address.
  • Interleaver controller 220 includes address generator 193, which is preferably implemented as described above, and multiplexers 232 and 234. Multiplexer 232 receives the samples from the input channel, and routes them to one of the plurality of memory elements 222 in accordance with the address generated by address generator 193. Multiplexer 234 receives the same address from address generator 193, and reads out the data stored in the addressed memory element. Thus, address generator 193 not only generates the addresses to which data is saved, but also generates the addresses from which data is read, which allows the output channel to transmit the permuted sequence.
  • Address generator 193 has as an input ctr[], which allows for synchronization with the input sequence of samples. By using this configuration it is possible to reduce the number of memory elements to / n 2 - ⁇ .
  • Figure 22 illustrates a method of interleaving according to the present invention.
  • step 240 a predetermined number of samples are received and stored in the memory.
  • n/2 samples are stored, and the capacity of the memory is n/2 to achieve 100% utilization of the allocated resources, however one skilled in the art will appreciate that the number of stored elements is determined by the maximum distance between two input samples that are adjacent in the permuted output sequence.
  • the present invention receives the input sequence x 0 , Xi, x 2 x n /2- ⁇ . Xn/2. X ⁇ /2+1,--., n- ⁇ and permutes it to provide the interleaver output pattern of x 0 , x n /2. i. Xn 2 +1 ) ... , X n/2 - ⁇ . n -i ⁇ the maximum distance is n/2, though other permuter patterns would have greater or smaller distances.
  • the first n/2 samples are stored in sequential memory addresses, so that the first sample x 0 would be stored in memory address 0, as shown in the timing diagram of Figure 17.
  • step 242 the address of the memory element storing the first sample in the permuted sequence is determined.
  • step 244 and 246 the contents of the memory element at the determined address are read out, and replaced with a newly received sample.
  • step 248 the address of the next sample in the permuted sequence is determined, and the process returns to step 244.
  • n/2 samples are initially stored in step 240, the actual number of samples that has to be stored is determined by the maximum distance between samples, and the permuted output order of the samples.
  • a single dual port memory is used in the interleaver along with two address generators. The first address generator is used to determine the address to which data will be written, while the second generator is used to determine the address from which data is read out. This allows the system to continue reading out data while the input sequence has gaps. When the input sequence has a gap, the input data stops and no new data is stored in the interleaver. This will result in the generated input address diverging from the output addresses because there is no progress made on the input addresses, while the output addresses are still generated and read out from.
  • a connection from the write counter (ctr) into the read controller is required.
  • the read controller can then use this signal to determine if data is available for reading (i.e. by comparing the write ctr to the read ctr).
  • the write controller writes data every time data is presented to it.
  • the read controller monitors the amount of data that has been written and begins reading when the first n/2 samples have been written. At this point the read is driven by the input data presentation, however once the full n samples have been written to the memory unit the read controller then continuously dumps the presented output data regardless of whether input data is presented or not.
  • Such an embodiment can be implemented using two address generators 193, as described above, one for the read address generator and one for the write address generator.
  • the two address generators 193 would be connected to each other, so that the read controller can determine if data is available, either by determining that the required sample has been stored, or that a complete n samples have been stored.
  • Such an interleaver architecture allows the write address generator to determine a storage address for a received sample, while the read address generator determines the storage address associated with the next output sample.
  • the connection between the two address generators allows a comparison of the read and write counters to allow the write address generator to avoid overwriting valid data, while allowing the read address generator to determine which addresses contain valid data to allow for reading out the memory addresses in the correct order.
  • the interleaver of the present invention can be used in a number of other environments. Due to its ability to group samples, and its reduced memory requirement, the above described interleaver is applicable to, but not limited to, use in other discrete transform applications, such as z-transform processors and Hadamard transform processors.
  • Table 1 A comparison of the hardware requirements of the prior art pipeline processor FFT architectures is shown in Table 1. In order to ease comparisons of radix-2 with radix-4 architectures all values in Table 1 have been listed using the base-4 logarithm. The results show that the R2SDP architecture of this invention reduces the requirement for complex multipliers, complex adders, and memory allocation with out-of-order input data.
  • the memory size doubles in order to implement a buffer to generate the bit- reversed data sequence for the FFT processor.
  • the address generation scheme of the R2SDP design is more complex than a simple R2SDF or R2MDC implementation, however the requirements for the rest of the system are significantly smaller than those two implementations, offsetting the area and cost of the extra controls.
  • R4SDP radix-4 implementation
  • a radix-8 design following this invention can achieve a reduced multiplier count of 66% that described for the radix-2 design by further reducing redundant multiplications.

Landscapes

  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Discrete Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

Cette invention se rapporte à un processeur FFT utilisant une seule voie de retard et un permuteur et assurant une réduction de la surface d'implémentation et une réduction correspondante de la consommation de courant grâce à des rendements obtenus par la modification d'une unité papillon et grâce à l'utilisation d'un nouvel entrelaceur. L'unité papillon modifiée est obtenue par retrait des multiplicateurs de variables complexes, ce qui est possible en raison de la simplification des facteurs de retouche dans les étages qui correspondent à l'unité papillon modifiée.
PCT/CA2004/002049 2003-11-26 2004-11-26 Processeur fft pipeline avec entrelacement d'adresses de memoire WO2005052808A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US52487903P 2003-11-26 2003-11-26
US60/524,879 2003-11-26
CA 2451167 CA2451167A1 (fr) 2003-11-26 2003-11-26 Processeur fft a pipeline permettant l'entrelacement des adresses de memoire
CA2,451,167 2003-11-26

Publications (1)

Publication Number Publication Date
WO2005052808A1 true WO2005052808A1 (fr) 2005-06-09

Family

ID=34634832

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CA2004/002048 WO2005052798A1 (fr) 2003-11-26 2004-11-26 Memoire a entrelacement
PCT/CA2004/002049 WO2005052808A1 (fr) 2003-11-26 2004-11-26 Processeur fft pipeline avec entrelacement d'adresses de memoire

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/CA2004/002048 WO2005052798A1 (fr) 2003-11-26 2004-11-26 Memoire a entrelacement

Country Status (1)

Country Link
WO (2) WO2005052798A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650706B (zh) * 2009-06-30 2012-02-22 重庆重邮信科通信技术有限公司 Fft分支计算方法及装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609395B (zh) * 2011-12-22 2015-08-19 中国科学院自动化研究所 一种单一内外交织结构的可变尺寸块状fft运算装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4868776A (en) * 1987-09-14 1989-09-19 Trw Inc. Fast fourier transform architecture using hybrid n-bit-serial arithmetic
US6081821A (en) * 1993-08-05 2000-06-27 The Mitre Corporation Pipelined, high-precision fast fourier transform processor
US6240062B1 (en) * 1997-05-02 2001-05-29 Sony Corporation Fast fourier transform calculating apparatus and fast fourier transform calculating method
US20030154343A1 (en) * 2001-12-25 2003-08-14 Takashi Yokokawa Interleaving apparatus and interleaving method, encoding apparatus and encoding method, and decoding apparatus and decoding mehtod

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694347A (en) * 1991-12-19 1997-12-02 Hughes Electronics Digital signal processing system
KR100192797B1 (ko) * 1996-07-01 1999-06-15 전주범 정적 램을 이용한 길쌈인터리버의 구조
FR2772950B1 (fr) * 1997-12-19 2000-03-17 St Microelectronics Sa Dispositif electronique de calcul d'une transformee de fourier a architecture dite "pipelinee" et procede de commande correspondant
US6490672B1 (en) * 1998-05-18 2002-12-03 Globespanvirata, Inc. Method for computing a fast fourier transform and associated circuit for addressing a data memory
KR20020034746A (ko) * 2000-11-03 2002-05-09 윤종용 고속 및 면적효율적인 알고리즘을 적용한 고속 프리에변환 프로세서
KR100860660B1 (ko) * 2002-01-09 2008-09-26 삼성전자주식회사 통신시스템의 인터리빙 장치 및 방법
EP1463255A1 (fr) * 2003-03-25 2004-09-29 Sony United Kingdom Limited Entrelaceur pour le mappage de symboles sur les porteuses d'un système MDFO (multiplexage par division en fréquences orthogonales)

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4868776A (en) * 1987-09-14 1989-09-19 Trw Inc. Fast fourier transform architecture using hybrid n-bit-serial arithmetic
US6081821A (en) * 1993-08-05 2000-06-27 The Mitre Corporation Pipelined, high-precision fast fourier transform processor
US6240062B1 (en) * 1997-05-02 2001-05-29 Sony Corporation Fast fourier transform calculating apparatus and fast fourier transform calculating method
US20030154343A1 (en) * 2001-12-25 2003-08-14 Takashi Yokokawa Interleaving apparatus and interleaving method, encoding apparatus and encoding method, and decoding apparatus and decoding mehtod

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650706B (zh) * 2009-06-30 2012-02-22 重庆重邮信科通信技术有限公司 Fft分支计算方法及装置

Also Published As

Publication number Publication date
WO2005052798A1 (fr) 2005-06-09

Similar Documents

Publication Publication Date Title
US20080288569A1 (en) Pipelined fft processor with memory address interleaving
US7428564B2 (en) Pipelined FFT processor with memory address interleaving
He et al. A new approach to pipeline FFT processor
US6098088A (en) Real-time pipeline fast fourier transform processors
US5500811A (en) Finite impulse response filter
US6366936B1 (en) Pipelined fast fourier transform (FFT) processor having convergent block floating point (CBFP) algorithm
AU2005269896A1 (en) A method of and apparatus for implementing fast orthogonal transforms of variable size
US5034910A (en) Systolic fast Fourier transform method and apparatus
CN101149730A (zh) 使用主要因素算法的最佳离散傅利叶转换方法及装置
KR20090127462A (ko) Fft/ifft 연산코어
US20100128818A1 (en) Fft processor
EP1008060A1 (fr) Dispositif et procede de calcul d'une transformee de fourier rapide
WO2002091221A2 (fr) Generateur d'adresses pour processeur de transformation de fourier rapide
EP2144172A1 (fr) Module de calcul pour calculer un multi radix butterfly utilisé dans le calcul de DFT
CN1685309A (zh) 计算上高效数学引擎
JP5486226B2 (ja) ルリタニアマッピングを用いるpfaアルゴリズムに従って種々のサイズのdftを計算する装置及び方法
EP1076296A2 (fr) Dispositif de stockage de données pour une transformation rapide de Fourier
EP2144173A1 (fr) Architecture matérielle pour calculer des DFT de différentes longueurs
US6631167B1 (en) Process and device for transforming real data into complex symbols, in particular for the reception of phase-modulated and amplitude-modulated carriers transmitted on a telephone line
JPH08320858A (ja) フーリエ変換演算装置および方法
WO2005052808A1 (fr) Processeur fft pipeline avec entrelacement d'adresses de memoire
Das et al. Efficient VLSI Architectures of Split-Radix FFT using New Distributed Arithmetic
Jones 2D systolic solution to discrete Fourier transform
US8572148B1 (en) Data reorganizer for fourier transformation of parallel data streams
CA2451167A1 (fr) Processeur fft a pipeline permettant l'entrelacement des adresses de memoire

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase