GB2283592A

GB2283592A - Computational structures for the frequency-domain analysis of signals and systems

Info

Publication number: GB2283592A
Application number: GB9322858A
Authority: GB
Inventors: Duraisamy Sundararajan; Mohammad Omair Ahmad; Mayasandra Nanjundiah Swamy Sr
Original assignee: Individual
Current assignee: Individual
Priority date: 1992-12-23
Filing date: 1993-11-05
Publication date: 1995-05-10
Anticipated expiration: 2013-11-05
Also published as: CA2086174C; GB2283592B; GB9322858D0; CA2086174A1

Abstract

The present invention allows analysis of signals and systems by providing a family of structures for the computation of the discrete Fourier transform (DFT) of a discrete signal of N samples. A member of this set of computational structures is characterized by two parameters, u and v, where u (u = 2<r>,r = 1,2,..,(log2 N) - 1) specifies the size of each data vector applied at the two input nodes of a butterfly and v represents the number of consecutive stages of the structure whose multiplication operations are merged partially or fully. The nature of the problem of computing the DFT is such that the sub-family of the structures with u = 2 suits best for achieving its solution. These structures have the features that eliminate or reduce the drawbacks of the existing structures while retaining their simplicity and regularity. A comprehensive description of the two most useful structures from this sub-family along with their hardware and software implementations is presented. The hardware includes a pipe-lined arrangement of adders, subtracters, multipliers and plus-minus units. <IMAGE>

Description

Figure 21. The flowchart of the 2 x 2 PM DIF DFT structure for 8 x 8 complex-valued 2-D input data. A twiddle factor W8S is shown only by its exponent 5.

It is a power of 2 with a positive integer exponent in the range 0 to M - 1. It is possible to combine fully or partially the multiplication operations of two or more consecutive stages in the PM structures in order to reduce the total number of multiplication operations in the structures. The letter v, in the range 1 to (M - log2 u), specifies the number of stages whose multiplications are combined. For the Cooley-Tukey structure v could range from 1 to M indicating that the multiplication operations of up to M stages can be combined.

Though the set of PM computational structures is quite large, the most useful structures are those with it = 2 and v = 1 or v = 2. These structures provide several advantages such as reduced number of multiplication and addition operations, no independent data swapping operations, and reduced overheads of bit-reversals and array-index updating resulting in efficient hardware and software implementations over the Cooley-Tukey radix-2 structure.

The set of the PM computational structures can be partitioned into subsets of structures each characterized by a value of it, the number of elements in the input-output vectors of a butterfly. The decomposition process of the computation of an N-point DFT into 2-point DFTs for each subset requires the incorporation of the vectorization process in the definition of the DFT corresponding to the value of it of that subset. Starting from the definition of the DFT, we present an equivalent vectorized format for any value of u.

The DFT of a data sequence {x(n), n = 0, 1,..., AT - 1) is defined as

where WN = exp-j##. For a positive integer u as defined earlier, (1) can be rewritten as X(k) = {x(0) + x(1N/u)WN1kN/u + ... + x((u-1)N/u)WN(u-1)kN/u}WN0k +{x(1) + x(1 + 1N/u)WN1kN/u + ... + x(1 + (u - 1)N/u)WN(u-1)kN/u}WN1k + .

.

+{x(N/u - 1) + x(2N/u - 1)WN1kN/u + ... + x(uN/u - 1)WN(u-1)kN/u}WN(N/u-1)k (2) Since WNN/u = Wu1, (2) can be expressed as X(k) = {x(0) + x(1 )Wu1k + ... + x((u-1)N/u)Wu(u-1)k}WN0k U + x(1 + N u u .

.

+{x(N/u - 1) + x(2N/u - 1)Wu1k + ... + x(uN/u - 1)Wu((u-1)k}WN(N/u-1)k Note that the expressions inside each pair of braces is a u-point DFT and when evaluated will give rise to it distinct values fork = 0,1...,u - 1. Let a(n)(n = 0,1,...,N/u - 1) denote vectors, each consisting of u-elements. The vector a(n) consists of the nth u-point transform values a,(n)(q = 0,1,..., u-1) as defined below. a(n) = {a0(n),a1(n),...,au-1(n)} = {values of the u - point DFT of {x(n + 0N/u),x(n + 1N/u),...,x(n + (u - 1)N/u)}}, n = 0,1,..., 1. (4) it Therefore, (3) can be rewritten as X(k) = aq(0)WN0k + aq(1)WNk + ... + aq(N/u - 1)WN(N/u-1)k

Since the values of the u-point transform aq(n) are periodic with a period of u, for values of k equal to or greater than u in (5), the values aq(n) repeat. Therefore, q = k mod it (6) Replacing the argument k by k + pN/u in (5) and (6) yields X(k + pN/u) = aq(0)WN0(k+pN/u) + aq(1)WN1(k+pN/u) + ... + aq(N/u - 1)WN(N/u-1)(k+pN/u), k = 0,1,...,N/u - 1, p = 0,1,...,u - 1 (7) 2Vectors will be denoted by boldfaced letters q = (k + p) mod u (8) Note that the expression for q as given by (8) simplifies to q = k mod it for u # a Let us vectorize the output transform values X(k) as A(k) = {A0(k),A1(k),...,Au-1(k)} = {X(k + pN/u),p = 0,1,...,u-1}, k = 0,1...,N/u - 1. (9) Therefore, (7) can be expressed as Ap(k) = aq(0)WN0k(WNN/u)0p + aq(1)WN1k(WNN/u)1p + ... + aq(N/u - 1(WN(N/u-1),(WNN/u)(N/u-1)p = aq(0)WN0k(Wu1)0p + aq(1)WN1k(Wu1)1p + ... + aq(N/u - 1)WN(N/u-1)k(Wu1)(N/u-1)p (10) With the vectorization of the input quantities z(n) as carried out in (4) and the output quantities X(k) as carried out in (9) and from (10), the DFT as defined by (1) can be equivalently written as

where q = (k + pN/u) mod it and p = 0,1,.. .,u - 1. Thus, the PM structures must have a part in which it-element vectors are formed by implementing u-point transforms.

The inverse DFT (IDFT) is given by

A set of expressions similar to those given by (4), (9), and (11) can be obtained for the case of IDFT. The input quantities X(k) are vectorized as B(k) = {B0(k),B1(k),...,Bu-1(k)} = {values of the u-point IDFT of {X(k + 0N/u,X(k + 1N/u),...,X(k + (u - 1)N/u)}}, k = 0,1,...,N/u - 1. (13) The output data values z(n) are vectorized as b(n) = {b0(n),b1(n),...,bu-1(n)} = {x(n + pN/u),p = 0,1,...,u - 1}, n = 0,1,...,N/u - 1. (14) An expression similar to (11) for computing the IDFT can be obtained from (12) using (13) and (14) as

where q = (n + pN/u) mod u and p = 0,1,...,u - 1.

In the subset of the computational structures with u = 2, data samples are organized as 2-element vectors. For the specific value of u = 2, (4), (9), and (11) become, respectively, as a(n) = {a0(n),a1(n)} = {x(n) + x(n + N/2),x(n) - x(n + N 16) A(k) = {A0(k),A1(k)} = {X(k),X(k + N/2)}, k = 0,1,...,N/2 - 1 (17)

where p = O or 1 represents the first or the second element, respectively, of the output vector A(k), and q = k mod 2, N > 2 Similarly, in the case of IDFT, for u = 2, (13), (14), and (15) become, respectively, as B(k) = {B0(k), B1(k)) = {X(k) + X(k + N2), X(k) - X(k + 2 2 k = 0,1...,N/2 - 1 (19) Note that the process of forming the vectors B(k) differs from the one using (12), in that the values are not divided by 2, the value of the divisor N outside the summation in (12).

This is done to avoid the division by a constant more than once. b(n) = (bo(n), bl(n)) = {:z;(n),z(n+ N2)), N 2 -1 (20)

where p = 0 or 1 represents the first or the second element, respectively, of the output vector b(n), and q = n mod 2, N > 2 Equations (18) and (21) can have direct hardware or software implementation to compute the DFT and IDFT, respectively. The decomposition of these equations can be expected to yield faster and more efficient computational structures. Therefore, the procedure for the decomposition of these equations and the derived structures will be presented in detail in the following sections. The procedure of obtaining the computational structures for the parameter values of u > 2 follow along the same lines. However, it should be noted that the number of array-index updatings in the PM structures will be proportional to NU due to the vectorization of the data samples. The Cooley-Tukey structure for computing DFT and IDFT is derived by decomposing (1) and (12) in which the input and output quantities to and from the butterflies are scalar values. Therefore, for the Cooley-Tukey computational structure, it = 1.

3 The 2 x 1 PM DIT DFT Computational Structure for Complex-Valued Data The 2 x 1 PM DIT computational structure to compute the DFT for complex-valued input data is derived by decomposing (18) corresponding to a number of computational stages where the multiplication operations of consecutive stages are not combined.

Decomposition of (18) corresponding to the even-indexed and the odd-indexed input samples aq(n) yields

Since (-1)p(2n) = 1 and WN2nk = WN/2nk, (22) cam be rewritten as

The output transform (Ap(k + N/4) which is N/4 samples ahead of the kth sample Ap(k), can be obtained by replacing k by k + N/4 in (23). The resultion equation, by noting that WNN/4 = -j and WN/2N/4 =-1, can be written as

k = 0,1,...,N/4 - 1 (24) Using (18), the transform of the N even-indexed input samples a9(n), represented by A0e(k) and A1e(k), respectively, is given by

and

and the transform of the N/4 odd-indexed input samples a9(n), represented by A0o(k) and A1o(k), respectively, is given by

and

where k = 0,1,...,N/4-1. Now (23) and (24), using the smaller size transforms given above can be written as Ap(k) = A0e(k) + (-1)pWNkA0o(k) (25) and Ap(k + N/4) = A1e(k) + (-1)p)(-j)WNkA1o(k) (26) where k = 0,1,..., N/4-1. From (25) and (26), we see that in order to compute the transform values Ap(k) and Ap(k+ N), the precomputation of Ao(k) and Al(k), and Ao(k) and Al (k) is required. Thus, the problem of computing an N-point DFT has been decomposed into a problem of computing two N/2-point DFTs. This process of decomposition can be continued.

In general, the relations governing the basic computation at the rth stage can be deduced from (25) and (26) as A(r+l)(h) = A(r)(h) + A(j)(1)WNS A1(r+1)(h) = A0(r)(h) - A0(r)(l)WN3 A0(r+1)(l) = A1(r)(h) + A1(r)(l)WN3+N/4 A1(r+1)(l) = A1(r)(h) - A1(r)(l)WN3+N/4 where s is an integer whose value depends on the stage of computation r and the index h.

These equations characterize the input-output relations of a butterfly, the basic computing unit, shown in Fig. 1. These equations represent two 2-point DFTs after the input values indexed with I are multiplied with appropriate twiddle factors. In the general case, each butterfly of the PM structures compute it 2-point transforms.

Repeated use of (25) and (26) yields the flowchart of the computational structure for a specific value of N. The set of butterflies for the last stage is found by allowing k to vary from 0 to N4 - 1 in (25) and (26). The output vectors of the preceding stage are split into two sets of N vector DFTs. The set of butterflies for each of the two independent sets of this stage is found by allowing k to vary from 0 to N8 ~ 1 in equations obtained from (25) and (26) by replacing N by N2 . As seen from (25) and (26) ( which correspond to the last stage), h and 1 of the general butterfly differ by a number that ranges from 1 in the first stage to 2N in the last stage. The process of decomposition stops when the size of the transforms reduces to two vectors. The flowchart of the 2 x 1 PM DIT DFT computational structure for N = 32 is shown in Fig. 2.

The 2 x 1 PM DIT structure for computing IDFT can be obtained by decomposing (21) in a similar way. The process will lead to relations similar to (25) and (26) with the sign of the exponents of WN negated and -j replaced by j. Finally, each output data vector component bg(n) is divided by N.

4 The 2 x 1 PM DIF DFT Computational Structure for Complex-Valued Data.

The 2 x 1 PM DIF computational structure to compute the DFT for complex-valued input data is derived by decomposing (18) corresponding to a number of computational stages where the multiplication operations of consecutive stages are not combined.

Decomposition of (18) corresponding to the first-half and the second-half of the input samples aq(n) yields

Replacing n by n + 4 in the second summation of the right side of (27) gives

The pair of summations in (28) can be combined into a single one, giving

The even-indexed and odd-indexed transform values Ap(k) are readily obtained from (29) by replacing k by 2k and k by 2k + 1, respectively, as

and

where k = 0,1,...,N/4 N 1. From (30) and (31), we see that the input values aq(n) and aq(n + N/4) are combined to reduce the problem size. The problem of computing an N-point DFT, therefore, has been decomposed into a problem of computing two- N2-point DFTs.

This process of decomposition can be continued. In general, the relations governing the basic computation at the rth stage can be deduced from (30) and (31) as a(rl)(h) = alr)(h) + a(T)(Z) a(r+l)(h) = a(r)(h) - ao(r)(1) a0(r+1)(l) = a1(r)(h)WN3 + a1(r)(l)WN3+N/4 a1(r+1)(l) = a1(r)(h)WN3 - a1(r)(l)WN3+N/4 where s is an integer whose value depends on the stage of computation r and the index h.

These equations characterize the input-output relations of a butterfly, the basic computing unit, shown in Fig. 3.

Repeated use of (30) and (31) yields the flowchart of the computational structure for a specific value of N. The set of butterflies for the first stage is found by allowing k to vary from 0 to N4 ~ 1 in (30) and (31). The input vectors of the succeeding stage are split into two sets of N-vector DFTs. The set of butterflies for each of the two independent sets for this stage is found by allowing k to vary from 0 to N/8-1 in equations obtained from (30) and (31) by replacing N by N. The process of decomposition stops when the size of the transforms reduces to two vectors. The flowchart of the 2 x 1 PM DIF DFT computational structure for N = 32 is shown in Fig. 4.

The 2 x 1 PM DIF structure for computing IDFT can be obtained by decomposing (21) in a similar way. The process will lead to relations similar to (30) and (31) with the sign of the exponents of WN negated and -j replaced by j. Finally, each output data vector component bq(n) is divided by N.

5 Hardware Realization of the 2 x 1 PM DIT Structures Fig. 5 shows a possible hardware realization of the butterfly of the 2 x 1 PM DIT DFT structure. Several of these units can be used, and in the ultimate realization all the stages can be realized with complete hardware without any software at all. For the sake of discussion, assume that only one butterfly is available in a computing system as part of an arithmetic unit. The butterfly unit could be made for different precisions and types of arithmetics. The butterfly essentially consists of four basic arithmetic units: multipliers, adders, subtracters, and plus-minus units marked, respectively, as A, B, C, and D. The + and - symbols on a subtracter indicates the minuend and subtrahend, respectively. A plus-minus unit has a + input and a +/- input and two outputs. The + output is simply the addition of the two inputs, whereas the - output is the difference of + and +/- inputs.

As the sum and the difference produced by the plus-minus units are stored in consecutive locations in the memory, both the results need only one destination address and can be efficiently moved to the memory as a long word. Each butterfly has two twiddle factors WNS and WNs+N4. Assuming that W N = C + jS, W N+ 4 = 5 - jc. Note that the inputs to the butterfly are decomposed into their real and imaginary parts before they are applied to multiplier or plus-minus units. Similarly, the outputs from the butterfly are also in decomposed form. Specifically, A(r) = At(') + jAl r) (i = 0, 1). The multipliers, the adders and subtracters, and the plus-minus units of a butterfly can be used as a threestages of a pipeline architecture. For each cycle, one set of four complex-valued outputs are produced from four complex-valued input signals. Each stage of the pipeline is busy with the next set of data. The operation of the pipeline needs buffer units between the stages and control circuits that are not shown in Fig. 5.

Fig. 6 shows the hardware realization of the 2 x 1 PM DIT DFT structure for complexvalued input data with N = 8. The first part of structure, i.e., the part to the left of the vertical solid line, implements the 2-point transforms to form the vectors and the swapping of the data. The second part of the structure, the part to the right of the solid line consists of two stages separated by a vertical dashed line. The first stage has no multiplier units as the twiddle factors for this stage are all either 1 or -j. The hardware realization for the 2 x 1 PM DIF structures will contain same number of multipliers, adders, etc., but the interconnections will be different.

6 Software Implementation of the 2 x 1 PM Structures for Complex-Valued Data The software implementation of the 2 x 1 PM DIT structure consists of a three-loop construct similar to the implementation of the Cooley-Tukey structures. In order to reduce the number of complex multiplications, it is a usual practice to derive special butterfly relations when the multiplications are to be made with twiddle factors WN0, WN#N/2, WN#N/4, N and WNiW. z z If no special butterflies are set up, it is called a onebutterfiy implementation. A two-butterfly implementation consists of an additional butterfly to compute efficiently the multiplications with twiddle factors WN0 and WNN/4. In a three-butterfly implementation, a @N third butterfly is setup to handle multiplications with twiddle factor W N 8. A C language implementation of the 3-butterfly 2 x 1 PM DIT structure ior complex-valued input data is given in Sections A and B of the appendix. Software implementation of the 2 x 1 PM DIF structure can be obtained from that of the DIT structure by making a few changes.

A software implementation to compute the IDFT can also be obtained easily. However, it is practical and much easier to use the DFT implementation for computing the IL)FT by interchanging the real and imaginary parts of the data during their read and write operations and finally dividing the output values by N, the length of the data set.

7 Performance of the Software Implementation of the 2 x 1 PM Structures The number of each of the three operations of real multiplication, real addition, and real twiddle factor computation required by the 3.butterfly software implementation of the 2 x 1 PM structures for complex-valued data with various values of N is given in the second, third, and fourth columns, respectively, of Table I. In this implementation, the twiddle factors numbering less than N2 are generated as required. The Cooley-Tukey radix-2 structures require a 4-butterfly implementation to achieve the performance figures of Table I.

The vector representation of the data in the PM structures, however, provides additional advantages not achievable by the Cooley-Tukey structures. Irrespective of the scheme used for the bit-reversal process, the number of bit reversals in the 2 x 1 PM structures is one-half of that required in the Cooley-Tukey structures, since this process is carried out on vector locations and there are only N2 vectors. In general, the PM structures require only N bitreversals in order to arrange the data in the required sequence at the input or the output of the structure. For the same reason, the number of array-index updatings required is also reduced by one-half. In general, the number of array-index updatings in a PM structure is proportional to NU. Another advantage related to the vector representation of the data in the PM structures is that an independent data swapping operation is not required. This is due to the fact that the data values with which a 2-point transform is made in the first stage of the Cooley-Tukey structure are stored at a memory location assigned for a vector in the vector formation stage of the 2 x 1 PM structures. These vectors can be swapped at the time of vector formation. It is to be noted that in order to eliminate independent data swapping operation in DIF structures, a version in which data is scrambled at the input can be used. In this case, however, either the twiddle factors or the addresses of the input vectors to the butterfly must be generated in bit-reversed order. The 2 x 1 PM DIT structures are more efficient compared with the DIF structure in as far as using the natural order of the input data and providing the same order for the output data.

In order to compare the efficiency of the 2 x 1 PM DIT structure with that of the Cooley-Tukey radix-2 structure, a 4-butterfly C language implementation of the latter was coded. The execution times of the two structures were recorded over 100 runs for the sets of input data each of 4096 samples. The software implementation of the 2 x 1 PM DIT structure was found to be 9% faster on a modern workstation.

8 The 2 x 1 PM DIT DFT Computational Structure for Real-Valued Data The transform of a set of real-valued input data is Hermitian-symmetric, i.e., the real part of the transform is even symmetric and the imaginary part is odd symmetric [4]. For an even, the transform values X(0) and X(N) are real-valued and distinct and X(k) = X*(N - k), 1 # k # N/2 - 1.

Therefore, for a specific value of N, only the first-half of the butterflies of each group of the computational structure corresponding to complex-valued data are computed during the computation of any stage. Due to this reason, the memory requirement is reduced by one half and the computation requirement is slightly less than one half of that of the structures for complex-valued input data. Therefore, the computational structure is similar to but only one half in size corresponding to complex-valued input data.

In the subset of the computational structures with it = 2, data samples are organized as 2-element vectors. The storage of the vectors differs from that of the complex-valued data.

For the specific value of it = 2, vectors defined in (16) and (17) are modified, respectively, as a'(n) = {a(n),a(n + N/4)} = 3N 3N {{x(n) + x(n + N(2),x(n) - x(n + N/2)},{x(n + N/4 + x(n+ ),x(n + N/4) - x(n + )}}, 4 4 n = 0,1,...,N/4 - 1 (32) A'(Q) = {A(0),A0(N/4)} = 4 2 4 A'(k) = {A(k)} = {X(k),X(k + N/2)}, N k = 1,2,..., - 1 (33) 4 Fig. 7 shows the flowchart of the 2 x 1 PM DIT DFT computational structure for real- valued input data with N = 32. Although the structure has the same number of stages as its complex-valued counterpart given in Fig. 2, there are some differences that must be noted. The number of butterflies in each stage is one half. In the first stage, 4-point transforms are computed by merging two 2-point transforms. As a vector contains two 2point transforms, the four point transform is computed using the elements of a single vector.

This operation in Fig. 7 is marked as f indicating the computation of a 4-point transform in the flowchart. In the implementation, however, the vector formation, its scrambling, and the computation of the 4-point transform are all carried out at the same time. Since the first output vector has three transform values, the first butterflies in each group marked g is to be interpreted differently. While the second vector is computed in a regular manner, the computation of the first vector consists of computing the first transform value and two others that are separated from it by one-quarter and one-half of the samples in the group.

The other butterflies are similar to the structures of the complex-valued input data. As only the first half of each group of butterflies in each stage corresponding to the structure for complex-valued data are computed, negation, swapping and conjugation operations must be carried out in order to get the first half of the output transform vectors. The conjugation operations are shown in Fig. 7 by the symbol *. These operations are shown in Fig.

7 as post-extended operations of each stage, but in the implementation they are merged with the main stage and they do not pose any additional computational complexity. The hardware and software implementation of the PM structures for real-valued input data for computing DFT is similar to PM DIT DFT structures for complex-valued input data with the differences as explained above.

9 The 2 x 1 PM DIF IDFT Computational Structure for Real-Valued Data While the redundancies due to real-valued data can be exploited in the development of either the DIT or the DIF structures, it is easier to implement the DFT computation using the DIT structure and implement the IDFT computation using the DIF structures In the case of IDFT, for the specific value of u = 2, the vector definitions (19) and (20) are modified, respectively, as N B(0) = {B(0),B( )} = 4 {{X(0) + X(N/2),X(0) - X(N/2)},{X(N/4) + X*(N/4),X(N/4) - X*(N/4)} N N B'(k) = {B(k)} = {X(k) + X(k + ),(X(k + ))}, 2 2 k = 1,2,...,N/4 - 1 (34) 3N b'(n) = {b(n),b(n + N/4)} = {{x(n),x(n + N/2)},{x(n + N/4),x(n + )}}, 4 n = 0,1,...,N/4 - 1 (35) Fig. 8 shows the 2 x 1 PM DIF IDFT computational structure for real-valued input data with N = 32. In computing the IDFT, the input is always the first half of the vectors but the computation requires first quarter and third quarter of the input vectors. Therefore, similar to the forward transform, the conversion of the second quarter data into that of the third quarter data requires the operations of negation, data swapping, and conjugation. In the IDFT these operations are carried out before the butterfly operations of each stage, i.e., each stage is pre-extended. The conjugation and data swapping Operations are shown in the pre-extended stages but they are combined with the butterfly operations without posing any additional computational complexity. Since the first vector has three real values and one imaginary value that are independent, the first butterfly marked g of each group must be interpreted differently. While the second vector is computed in the same manner as for the complex-valued data, the computation of the first vector consists of computing the first inverse transform value and three others that are separated from it by one-quarter, one-half, and three-quarter of the samples of each group. In the last stage, two 2-point IDFTs are computed in a butterfly as usual. However, all the four input and output values are stored in a single vector and this operation is marked as f in Fig. 8. This operation of this stage can be combined with the operations of scrambling and dividing the transform values by N. The hardware and software implementation of the PM structures for real-valued input data for computing IDFT is similar to PM DIF IDFT structures for complex-valued input data with the differences as explained above.

10 The 2 x 2 PM DIT DFT Computational Structure for Complex-Valued Data In the 2 x 2 PM structures the data values are 2-element vectors and the multiplication operations of each pair of adjacent stages are combined.

To derive the 2 x 2 PM DIT structure for computing the DFT, consider the input-output relations (25) and (26) for the last stage of the 2 x 1 PM DIT structure, as given by Ap(k) = Ao(k) + (1)P(wNkAoo(k)) (36) and N Ap(k + ) = A1(k) + (-1)@(-j)(WN@A1@(k)) (37) where k = 0,1,...,N/4-1. The output transform values Ap(k + N/8) and Ap(k + ) which are 8 N/8 samples ahead of those given by (36) and (37), respectively, can be obtained by replacing k by k + N8 in these equations and the resulting equations are given by Ap(k + N/8) = A0e(k + N/8) + (@)p(WNk+N/8A0o(k + N/8)) (38) and 3N Ap(k + ) = A1e(k + N/8) + (-1)p(-j)(WNk+N/8A1o(k + N/8)) (39) 8 where k = 0,1,...,N/8 N/8 - 1. The variables Aqe(i) and Aqo(i) (q = 0,1; i = k,k + N8) on the right side of (36) through (39) represent the output values of the previous stage. These values, in turn, can be obtained recursively as Ape(k) = A0ee(k) + (-1)pWN2kA0eo(k), (40) Ape(k + N/8) = A1ee(k) + (-1)p(-j)WN2kA1eo(k), (41) Apo(k) = A0oe(k) + (-1)pWN2kA0oo(k), (42) and Apo(k + N/8) = A1oe(k) + (-1)p(-j)WN2kA1oo(k), (43) where k = 0,1,. ,N8 1. Since the terms Aqo(k) and A0,(k + N/8) of (36) through (39) have twiddle factors associated with them, (42) and (43) can be modified accordingly as WNkApo(k) = WNkA0oe(k) + (-1)pWN3kA0oo(k) (44) and WNk+N/8Apo(k + N/8) = WNk+N/8A1oe(k) + (-1)p(-j)WN3k+N/8A1oo(k) (45) where k = 0,1,..., N/8 - 1. From (44) and (45), we see that the butterfly of the 2 a butterfly of the 2 x 2 PM DIT structure with four twiddle factors shown in Fig. 9 can be deduced from (44) and (45) as A(r+l)(h) = A(r)(h)wN + A(r)(l)WN A(1r+1)(h) = Afr)(h)WNs - A0(r)(l)WN3s 3N A0(r+1)(l) = A1(r)(h)WNs+N/8 + A1(r)(l)WN3s+8 3N A1(r+1)(l) = A1(r)(h)WNs+N/8 + A1(r)(l)WN3s+8 where s is an integer whose value depends on the stage of computation r and on the index h.

Fig. 10 shows the flowchart of the 2 x 2 PM DIT structure for N = 32. There are four stages of computation after the formation of the input vectors. The multiplication operations of the first and the second stages and those of the third and the fourth stages are separately combined. However, if M is even, there is an odd number of stages after the formation of the input vectors. In this case, for the purpose of merging of multiplication operations, stages are paired starting from the last stage. The processing of the left out first stage is done in the same way as in the 2 x 1 PM DIT structure. The reason of not pairing this stage with the second stage is that it has only trivial multiplications. As an example, the flowchart for the 2 x 2 PM DIT structure for N = 16, (i.e., M = 4) is shown in Fig. 11.

The 2 x 2 PM DIT structure, as implemented by (36) through (39), (40), (41), (44), and (45), saves 2558 of complex multiplications when compared with the implementation of the 2 x 1 PM DIT structure. The multiplications in two adjacent stages in the 2 x 2 PM DIT structure have been combined such that the second stage of the pair has multiplications only by -j and the first one has less number of multiplications than those of the two stages together of the 2 x 1 PM DIT structure. In general, multiplication operation of two or more consecutive stages of a PM structure can be combined together in order to reduce such operations.

11 The 2 x 2 PM DIF DFT Computational Structure for Complex-Valued Data We will now derive the 2 x 2 PM DIF structure for computing the DFT. Expanding (30) such that the summation is only over N/8 terms and noting that WN/2N/8 = -j, yields

Changing k to 2k in (46) gives

3N (-1)pN/8(-1)k(a0(n + N/8) + (-1)pN/4a0(n+ ))}WN/4nk, 8 k = 0,1,...,N/8 - 1 (47) Similarly, replacing k by 2k + 1 in (46) yields

3N + N/8) - (-1)pN/4a0(n + ))}WN2nWN/4nk,, 8 k = 0,1,...,N/8 - 1 (48) Expanding (31) such that the summation is only over 8N terms yields

3N -1)pN/8WNN/8(-j)k(a1(n + N/8) + (-1)pN/4(-j)(-1)ka1(n + ))}WNnWN/2nk, 8 k = 0,1,...,N/4 - 1 (49) Changing k to 2k in (49) gives

3N (-1)pN/8WNN/8(-1)k(a1(n + N/8) + (-1)pN/4(-j)a1(n + ))}WNnWN/4nk, 8 k = 0,1,...,N/8 - 1 (50) Similarly, replacing k by 2k + 1 in (49) yields

3N (-1)pN/8(-j)WNN/8(-1)k(a1(n + N/8) - (-1)pN/4(-j)a1(n + ))}WN3nWN/4nk, 8 k = 0,1,...,N/8 - 1 (51) The input-output relations of a butterfly of the 2 x 2 PM DIF structures with four twiddle factors shown in Fig. 12, can be deduced from (50) and (51) as a0(r+1)(h) = a0(r)(h)WNs + a0(r)(l)WNs+N/8 a1(r+1)(h) = a0(r)(h)WNs - a0(r)(l)WNs+N/8 3N a0(r+1)(l) = a1(r)(h)WN3s + a1(r)(l)WN3s+8 3N a1(r+1)(l) = a1(r)(h)WN3s - a1(r)(l)WN3s+8 where s is an integer whose value depends on the stage of computation r and on the index h.

Fig. 13 shows the flowchart of the 2 x 2 PM DIF structure for N = 32. If M is even, there is an odd number of stages after the formation of the input vectors. In this case, in contrast to the 2 x 2 PM DIT structure, stages are paired starting from the first stage for the purpose of merging of multiplication operations. The processing of the left out last stage is done in a similar way as in the 2 x 1 PM DIF structure. The reason for not pairing the last stage with the preceding one is that it has only trivial multiplications. As an example, the flowchart for the 2 x 2 PM DIF structure for N = 16, (i.e., M = 4) is shown in Fig.

14. The 2 x 2 PM DIF structure, as implemented by (47), (48), (50), and (51), saves 25% of complex multiplications when compared with the implementation of the 2 x 1 PM DIF structure. The multiplications in two adjacent stages in the 2 x 2 PM DIF structure have been combined such that the first stage of the pair has multiplications only by -j and the second one has less number of multiplications than those of the two stages together of the 2 x 1 PM DIF structure.

12 Hardware Realization of the 2 x 2 PM DIT Structures Fig. 15 shows a possible hardware realization of the smallest basic computing unit of the 2 x 2 PM DIT DFT structure that repeats itself. It consists of 2 pairs of butterflies, with butterflies in each pair coming from two consecutive stages. As seen from Fig. 5, a single butterfiy realization has 8 multipliers, 2 adders, 2 subtracters, and 4 plus-minus units. Therefore, the 2 x 2 basic computing unit realization of Fig. 15 has a savings of 25% each for multipliers, adders, and subtracters. The basic computing unit produces 8 complex outputs in a cycle when operated as a pipeline. As the 2 x 2 PM structures have nontrivial multiplications in one of the two consecutive stages, the pipeline of the basic computing unit has only four stages compared with the 6 stages of an equivalent implementation for the 2 x 1 PM butterfly. The hardware realization for the 2 x 2 PM DIF structures will contain same number of multipliers, adders, etc., but the interconnections will be different.

13 Software Implementation of the 2 x 2 PM Structures for Complex-Valued Data The implementation of the 2 x 2 PM structures consists of a three-loop construct similar to the implementation of the 2 x 1 PM structures. A C language implementation of the 3-butterfly 2 x 2 PM DIT structure is given in Sections A and C of the appendix. Since only one of the two consecutive stages has non-trivial multiplications, the computations of the stages are carried out in pairs at a time. For the pair of two stages, sets of twiddle factors are generated. In the generation of each set of twiddle factors, only one complex twiddle factor need to be generated by a direct function call. The rest in the set are related to it and are computed by using algebraic expressions requiring on the average 2 real operations per real twiddle factor evaluation. This way of generating twiddle factors requires less effort while retaining good accuracy.

14 Performance of the Software Implementation of the 2 x 2 PM Structures For the 3-butterfly implementation of the 2 x 2 PM structures, the number of real multiplications and additions for complex-valued data with various values of N are given in the second and the third columns, respectively, of Table II. The Cooley-Tukey radix-2 structure provides only the counts of multiplications and additions as given in Table I. It is to be noted that the number of complex multiplications in the 2 x 2 PM structures is 25% less than in the Cooley-Tukey radix-2 structure. The number of twiddle factors generated by function calls and those computed using algebraic expressions are shown in columns four and five, respectively, of Table II. The number of real twiddle factors generated by direct function calls is less than N The number of twiddle factors generated through algebraic 6 expression is less than 2N The advantages of no data swapping and less bit-reversals of the 2 x 1 PM structures are also retained. Since the processing of the stages are done in pairs, the array-index updatings is further reduced by a factor of two as compared with that of the 2 x 1 PM structures.

In order to compare the efficiency of the 2 x 2 PM DIT structure with that of the Cooley-Tukey radix-2 structure, a 4-butterfly C language implementation of the latter was coded. The execution times of the two structures were recorded over 100 runs for the sets of input data each of 4096 samples. The software implementation of the 2 x 2 PM DIT structure was found to be 23% faster on a modern workstation.

15 The 2 x 2 PM DIT DFT Computational Structure for Real-Valued Data The 2 x 2 PM DIT DFT structure for real-valued data is similar to the 2 x 1 PM DIT DFT structure with the advantage of further saving of 25% of complex multiplications by merging the multiplication operations of consecutive stages. Fig. 16 shows the 2 x 2 PM DIT DFT structure for real-valued data. Similar to the implementation of the 2 x 2 PM DIT structure for complex-valued data, it is efficient to process two stages at a time.

16 The 2 x 2 PM DIF IDFT Computational Structure for Real-Valued Data The 2 x 2 PM DIF IDFT structure for real-valued data is similar to the 2 x 1 PM DIF IDFT structure with the advantage of further saving of 25% of complex multiplications by merging the multiplication operations of consecutive stages. Fig. 17 shows the 2 x 2 PM DIF IDFT structure for real-valued data. Similar to the implementation of the 2 x 2 PM DIF structure for complex-valued data, it is efficient to process two stages at a time.

17 The 2 x 1 and 2 x 2 PM Structures for 2-D Complex Valued Data The DFT and the IDFT of 2-D discrete signals is usually obtained by computing 1-D transforms of each row of data followed by l-D transforms of each column of the resulting data or vice versa. With this approach, direct application of l-D DFT or IDFT computational structures yields the required transform. Figs. 18, 19, 20, and 21 show, respectively the flowchart of the 2 x 1 DIT, 2 x 1 DIF, 2 x 2 DIT, and 2 x 2 DIS computational structures for computing 2-D DFT of 8 x 8 complex-valued data. In these figures, the twiddle are of the form W55 and are represented by their exponents. In Figs. 18 and 20, the 2-D input data is read row by row. The column transform is carried out in the first two stages. The third stage makes the vectors for the row transform carried out in the last two stages. It is possible to do the row transform first and column transform next by inputting the data column by column. In Figs. i9 and 21, the order of computation of the row and the column transforms is the reverse to that in Fig. 18. Also it is possible to use DIF structure for one transform and DIT structure for the next.

18 Architectural Suggestions As a result of the development of the PM structures, we would like to make some suggestions regarding the desirable features that the processors and programming languages used for the implementation should have in order to reduce the computation time. With the present state of the art of the computing technology, the operations of floating-point addi tion and multiplication take about the same time. In the DFT computation using efficient computational structures, the number of addition/subtraction operations is roughly two times that of the multiplications. However, it should be noted that, in most cases, for each addition operation there is a subtraction operation carried out using the same operands. In modern processors, the operand fetching time generally dominates the computation time of an operation. Further, the results of addition and subtraction of a pair of operands are stored in consecutive memory locations in the PM structures. These characteristics of the radix-2 structures can be exploited to reduce the computation time significantly provided more than one operation could be performed on the fetched operands. Therefore, we suggest that a plus-minus instruction of the type A=BC be provided in high-level languages and in the corresponding assembly languages. This type of instruction will reduce the fetching and decoding of two instructions to one and will also reduce the address calculation and fetching of four input operands to two. If, in addItion, this instruction has the feature of storing the results of the operations at consecutive memory locations, then this instruction wili also reduce the destination address calculation of two values to one, and the storage time will be shorter since the pair of computed quantities can be moved to memory as a block. This feature of a single destination address calculation and the block movement of data for the storage would be specifically beneficial to the PM structures other than the Cooley-Tukey structure. There could be other situations where the plus-minus operations are encountered. However, its immense significance to the DFT computation alone warrants its inclusion in the set of instructions of general-purpose and special-purpose computers and in the set of operators of the high-level languages.

The PM structures have reduced the bit-reversal overhead and eliminated the accompanying independent data swapping operation. The overhead of generating the twiddle factors have also been significantly reduced. However, this later task still forms a significant percentage of the execution time. A single function COS SIN(z) returning the cosine and sine of the argument z is very much desirable. This would provide the advantages of reducing the number of function calls by one-half and the effort required in evaluating the polynomials corresponding to the sine and cosine functions independently.

19 Conclusion In this disclosure, the invention of a large set of computational structures, referred to as the PM computational structures, along with their hardware and software implementations for transformation of real-valued or complex-valued one-dimensional or two-dimensional discrete signals from the time-domain to the frequency-domain and vice versa has been reported. These structures have been derived from the design of a large family of radix-2 decimation-in-time and decimation-in-frequency algorithms, referred to as the PM algorithms, to compute the discrete Fourier transform or the inverse discrete Fourier transform.

A member of the set of PM structures is characterized by two parameters, u and v where u (u = 2r, r = 0, 1,..., (log2 N) - 1) specifies the size of each data vector applied at the two input nodes of a butterfly and r represents the number of consecutive stages of the structure whose multiplication operations are merged partially or fully. The formation of the vectors of the input data allows the computation of u 2-point DFTs or IDFTs by a single butterfly.

Each computational structure essentially consists of two parts. In the first part, element vectors are formed from the N samples of the given discrete signal, where 2 < it < log2 N.

In the second part, each butterfly of an interconnected network of butterflies operating on 2 element input vectors produce 2 element output vectors.

The nature of the problem of computing the discrete Fourier transform is such that a more efficient solution is achieved by the computational structures with u = 2. In this disclosure, a detailed description of the 2 x 1 and 2 x 2 PM computational structures for real and complex-valued one and two-dimensional discrete signals have been presented. The PM structures described in this disclosure provide the advantages of less complex multiplications, less bit-reversals, less array-index updating, and no independent data swapping compared with the Cooley-Tukey radix-2 structure.

From the general principles of the present invention and the description of the two preferred embodiments, the 2 x 1 and 2 x 2 PM structures, presented in this disclosure, those skilled in the art will readily comprehend the various modifications to which the present invention is susceptible. Examples of some of the modifications are listed below: (i) Structures with butterflies that handle vectors of more than one size can be derived. (ii) Structures with varying vector lengths from stage to stage can be realized. (iii) Structures for any positive integer value of the data length N and for vector length u less than N can be designed. (iv) Structures for m-dimensional signals (m > 2) can be designed. (v) Higher-radix structures can be designed. (vi) Structures in which multiplication operations of more than two consecutive stages merged can be derived. For example, when the multiplication operations of three consecutive stages are merged in the 2 x 3 PM structure an additional savings of 81/3% in real multiplication operations over that provided by the 2 x 2 PM structures is obtained.

REFERENCES 1. Cooley, J. W. and Tukey, J. W. "An Algorithm for the Machine Calculation of Com plex Fourier Series ," Math. Computation., vol. 19, Apr. 1965, pp. 297-301.

2. Gentleman, W. M. and Sande, G. "Fast Fourier Transforms-for Fun and Profit,"in Proc. AFIPS, Joint Computer Conf., vol. 29, 1966, pp. 563-578.

3. Special Issue on FFT and Applications, IEEE Trans. Audio Electroacoust., vol. AU 15, June 1967.

4. G. D. Bergland "A Radix-Eight Fast Fourier Transform Subroutine for Real-Valued Series,", IEEE Trans. Audio Electroacoust., vol. AU-17, June 1969, pp. 138-144.

Appendix Software Implementation of 2 x v (v = 1,2) PM DIT DFT Structures for Complex-Valued Data The software implementation of 2 x v PM structures consists of several functions and an header file defining constants and macros. The header file and most of the functions are common to several PM structures. The header file and the common functions are presented in Section A. Specific functions for the 2 xl and 2 x2 PM DIT DFT structures are described in Sections B and C, respectively. Double precision format is used to represent the data values. All the functions are coded in the C language. The implementation can be modified to suit specific applications. For the best execution times of the software implementation of the PM structures, it is recommended to use the optimization option of the compiler.

A The Common Functions and the Header File /* This is the header file called pmSxv.h. The data length N and M = log2N are defined and must be set by the user. Various constants that are related to the length of data, N, are defined. Complex multiplication, addition1 and subtraction operations are defined as macro functions The structure for storing the data is also defined. */ *include < math.h > &num;include < stdio.h > &num;define N 128 /* Number of data points */ define M 7 /* M = logN */ define NB2 N/2 define NB4 N/4 &num;define NB8 N/8 &num;define NB16 N/16 define NB4P1 NB4 + 1 /* Macro defining two complex multiplications */ &num;define M~PLY(p, q, r, s, dl, d2, ci, c2, d3, d4, c3, c4) \ { p = dl * ci + d2 * c2;q = d2 * ci - di * c2; \ r = d3 * c3 + d4 * c4;s = d4 * c3 - d3 * c4; \ /* Macro defining 2 complex addition and 2 complex subtraction operations */ define B~FLY(p, q, r, a, 1, m, n, o, bi, ci, b2, c2, b3, c3, b4, c4) \ { p = bl - ci;q = bi + ci; r = b2 - c2;s = b2 + c2; \ 1 = b3 - c3;m = b3 + c3; n = b4 - c4;o = b4 + c4; \ /* Data structure defined for input/output data */ struct data { double pr; double mr; double pi; double mi; } a(NB2]; /* Data array of records */ /* This is the input function called input. It is assumed that the sequence in which the input data is stored in the input file is the N real-part values followed by N imaginary-part values. Input data is read from the input file into the array of records a. The input file name must be set by the user. in,put() struct data *p, *q, *r; FILE *fp; q = a; r = q + NB2; fp = fopen ("data7", "r"); /* set input file name for (p = q; p < r; p++) fscanf (fp, "%lf", tp- > pr); for (p = q; p < r; p++) fscanf (fp, "%lf", & p- > mr); for (p = q; p < r; p++) fscanf (fp, "%lf", tp- > pi); for (p = q; p < r; p++) fscanf (fp, "%lf", tp- > mi); /* In this function called make~vec, the formation of the input vectors as defined for the 2 x V PM structures is carried out. This function is common to all the structures with input in bit-reversed order. */ make~vec() struct data *p, *r, *u, *ss, *q; int brev, bit; double ti, t2, t3, tA; brev = - NB8; p = q = a; ss = p + NB4; while (p < ss) { if (brev > = NB8) { /* Find bit-reversal */ brev -= NB8; bit = NB16; while (bit < = brev){ brev -= bit; bit = 1; brev += bit; else brev += NB8; if ((w = p) != (r = q + brev)) { /* Svap and add */ if (p > r) { w += NB4Pl;r += NB4P1; } ti = w- > pr; t3 = w- > mr; t2 = w- > pi; t4 = w- > mi; B~FLY(w- > mr, w- > pr, r- > mr, r- > pr, w- > mi, v- > pi, r- > mi, r- > pi, r- > pr, r- > mr, ti, t3, r- > pi, r- > mi, t2, t4) else { r = p + NB4P1; tl = p- > mr; t2 = r- > mr; t3 = p- > mi; t4 = r- > mi; B~FLY(p- > mr, p- > pr, r- > mr, r- > pr, p- > mi, p- > pi, r- > mi, r- > pi, p- > pr, ti, r- > pr, t2, p- > pi, t3, r- > pi, t4) p++; r = q + brev + NB4; ti = p- > pr; t3 = p- > mr; t2 = p- > pi; t4 = p- > mi; B~FLY(p- > mr, p- > pr, r- > mr, r- > pr, p- > mi, p- > pi, r- > mi, r- > pi, r- > pr, r- > mr, t1, t3, r- > pi, r- > mi, t2, t4) p++; /* This function called out~put writes the transform values in the output file as pairs of real and imaginary parts. The output file name must be set by the user. The precision of the data can be changed by modifying the format specifications. */ out~put() { struct data *p, *q, *r; FILE *fp; fp = fopen("out",11w"); /* set output file name q = a; r = q + NB2; for (p = q; p < r; p++) fprintf (fp, "%8.2f%8.2f\n", p- > pr, p- > pi); for (p = q; p < r; p++) fprintf (fp, "%8.2f%8.2f\n", p- > mr, p- > mi); B Software Implementation of the 2x1 PM DIT DFT Struc ture for Complex-Valued Data In this section, we present an implementation of the 2 x 1 PM DIT structure for the computation of the DFT. This implementation generates all the twiddle factors by function calls. It can be easily modified to store the twiddle factors in a table.

/* This function called dit~pm2x1 is a three-butterfly implementation of the 2 x 1 PM DIT algorithm. The first butterfly includes multiplications only by 1 and -j. The second butterfly includes multiplications only by ###-j-## The third butterfly handles other complex multiplications each requiring 4 real multiplications and 2 real additions. In the first and second butterflies four output values are computed. The third butterfly, whic'n also produces four output values, is used in pairs. This minimizes the number of twiddle factors generated. */ dit~pm2x1() struct data *p, *r, *v, *x, *ss, *q; int xl, x2, x4, k; double t, t2, t3, t4, c, s, tf, inc, C; for (C = cos(inc = atan(1.0)), inc *= 2.0, x4 = 2, x2 = 1, xl = 0, q = a, ss = q + NB2; x2 < NB2; xl = x2, x2 = x4, x4 = 1, inc /= 2) { /* Main loop */ for (p = q; p < ss; p += x4) i r = p + x2; /*First butterfly */ tl = r- > pr; t2 = r- > mr; t3 = p- > mi; BFL?(r- > mr, r- > pr, p- > mr, p- > pr, p- > mi, p- > pi, r- > pi, r- > mi, p- > ,rr, r- > mi, p- > pr, tl, p- > pi, r- > pi, t3, t2) if (x1 > 0) { for (p = q + x1; p < ss ; p += x4) { r = p + x2; /*Second butterfly */ tl = C * (r- > pr + r- > pi); t3 = C * (r- > pi - r- > pr); t2 = C * (r- > mi - r- > mr); t4 = C * (-r- > mi - r- > mr); B~FLY(r- > mi, r- > pi, r- > mr, r- > pr, p- > mr, p- > pr, p- > mi, p- > pi, p- > mi, t4, p- > mr, t2, p- > pr, tl, p- > pi, t3) for Ctf = inc, k = 1; k < xl; k++, tf += inc) C r = (p = q + k) + x2; w = ( x = ss - k) - x2; c = cos(tf); s = sin(tf); /* Twiddle factors */ for (; p < ss; p += x4) i /* Third butterfly */ M~PLY(t1, t3, t4, t2, r- > pr, r- > pi, c, s, r- > mr, r- > mi, c, s) B~FLY(r- > pi, r- > mi, r- > mr, r- > pr, p- > mr, p- > pr, p- > mi, p- > pi, p- > mi, t4, p- > mr, t2, p- > pr, tl, p- > pi, t3) M~PLY(tl, t3, t4, t2, x- > pr, x- > pi, s, c, x- > mr, x- > mi, s, c) B~FLY(x- > pi, x- > mi, x- > mr, x- > pr, w- > mr, w- > pr, w- > mi, w- > pi, w- > mi, t4, w- > mr, t2, w- > pr, t1, w- > pi, t3) r += x4; v -= x4; x -= x4; /* This is the main function for the 2 x 1 PM DIT DFT structure. It calls the various functions to compute the DFT of the input data. The header file and the functions are included in the "include statements". These statements are not needed if all the functions are in the same file. */ include "pm2xv .h" &num;include "in~put" &num;include "make~vec" &num;include "dit~pm2x1" &num;include "out~put" main() in~put(); make~vec(); dit~pm2x1(); out~put(); C Software Implementation of the 2x2 PM DIT DFT Struc ture for Complex-Valued Data In this section, we present an implementation of the 2 x 2 PM DIT structure for the computation of the DFT of complex-valued data. This implementation generates some of the twiddle factors by function calls while the remaining ones are computed through algebraic operations using the values returned by function calls.

/* This function called odd~cyc implements the computation of a single stage. This function is called by the dit~pm2 x 2 function when the value of M is even odd~cyc() struct data *p, *q, *r; double tl, t2, t3; for (p = tatO], r = p + 1, q = p + NB2; p < q; p += 2, r += 2) { tl = r- > pr; t2 = r- > pi; t3 = r- > mr; B~FLY(r-gt;mr, r- > pr, p- > mr, p- > pr, r- > pi, r- > mi, p- > mi, p- > pi, p- > mr, r- > mi, p- > pr, t1, p-gt;mi, t3, p- > pi, t2) /* This function called dit~pm2 x 2 is a 3-butterfly software implementation of the 2 x 2 PM DIT DFT structure for complex-valued data. The first butterfly includes multiplications by 1 and -j, and by 1 and j712 . j##. The second butterfly handles multiplications by -### - j## and by # cos 8 # j sin 8. The third butterfly handles other complex multiplications each requiring 4 real multiplications and 2 real additions. Each of the first and second butterflies are used in pairs producing eight output values. The third butterfly is used in quadruple producing sixteen output values. This process minimizes the number of twiddle factors that are to be generated. */ dit~pm2x2() { struct data *p, *q, *r, *s, *w, *x, *y, *z, *ss; double cl, s1, c2, s2, c3, s3, c4, s4, c5, s5; double r1, r2, r3, r4, r5, r6, il, i2, i3, i4; double t1, t2, t3, t4, t5, t6, t7, t8; double pp, iss, C, C1, Sl; int inc, i8, i88, ij; iss = atan(l.0); C = cos(iss); Cl = cos(iss / 2); S1 = sin(iss / 2); i8 = 1; ss = & a[NB2 - 1]; if (M X 2 == 0) { iss /= 2; i8 = 2; oddcycC);</R for (inc = i8 < < 2, i88 = iS > > 1; i8 < NB4; i8 = inc, inc < < = 2, i88 = i8 1, iss /= 4) { /* Main loop */ for (p= & a[0]; p < ss; p += inc) { 8 = (r = (q = p + i8) + i8) + i8; /* First butterfly */ B~FLY(r3, r1, r4, r2, i3, i1, i2, i4, p- > pr, q- > pr, p- > mr, q- > mi, p- > pi, q- > pi, p- > mi, q- > mr) B~FLY(r6, rS, cl, c2, t3, t7, s2, s1, r- > pr, s- > pr, r- > mr, s- > mi, r- > pi, s- > pi, r- > mi, s- > mr) t5 = C * (c2 + s2); t2 = C * (s2 - c2); t1 = C * (c1 + si); t4 = C * (s1 - c1); B~FLY(p- > mr, p- > pr, p- > mi, p- > pi, q- > mr, q- > pr, q- > mi, q- > pi, ri, rS, il, t7, r2, tS, i2, t2) B~FLY(r- > mr, r- > pr, r- > pi, r- > mi, s- > mr, s- > pr, s- > pi, s- > mi, r3, t3, i3, r6, r4, t4, i4, ti) if (i8 != 1) { for (p = & a[i88]; p < ss;p += inc) { s = (r = (q = p + i8) + i8) + i8; /* Second butterfly */ tl = C * (q- > pr + q- > pi); t5 = C * (q- > pi - q- > pr); t6 = - C * (q- > mr + q- > mi); t2 = C * (q- > mi - q- > mr); B~FLY(r3, rl, r4, r2, i3, il, i4, i2, p- > pr, t1, p- > mr, t2, p- > pi, t5, p- > mi, t6) M~PLY(t1, t5, t2, t6, r- > pr, r- > pi, Cl, S1, r- > mr, r- > mi, S1, C1) M~PLY(t3, t7, t4, t8, s- > pr, s- > pi, S1, Cl, s- > mr, s- > mi, Cl, S1) B~FLY(r6, rS, t3, tl, t2, t4, t7, t5, tl, t3, t2, t4, t5, t7, t6, t8) B~FLY(p- > mr, p- > pr, p- > mi, p- > pi, q- > mr, q- > pr, q- > mi, q- > pi, r1, r5, ii, t4, r2, t3, i2, t7) B~FLY(r- > mr, r- > pr, r- > pi, r- > mi, s- > mr, s- > pr, s- > pi, s- > mi, r3, t2, i3, r6, r4, t5, i4, t1) for (pp = iss, ij = 1; ij < i88; ij++, pp += iss) { 5 = (r = (q = (p = & aEij]) + i8) + i8) + i8; w = (x = (y = (z = & a[NB2-ij]) - i8) - i8) - i8; ci = cosCpp); si = sin(pp); /* Twiddle factors */ s2 = (s3 = cl + cl) * sl; c2 = (83 *= cl) - 1.0; c3 = c1 * ((s3 += c2) - 2.0); s3 *= sl; c4 = C * (c1 - sl); s4 = C * (c1 + sl); s5 = C * (c3 - s3); c5 = - C * (c3 + s3); form; p < ss; p += inc) { /* Third butterfly */ M~PLY(t1, t5, t6, t2, q- > pr, q- > pi, c2, s2, q- > mr, q- > mi, c2, s2) B~FLY(r3, rl, r4, r2, i3, il, i2, i4, p- > pr, tl, p- > mr, t2, p- > pi, t5, p- > mi, t6) M~PLY(t1, t5, t2, t6, r- > pr, r- > pi, cl, s1, r- > mr, r- > mi, c4, s4) M~PLY(t3, t7, t4, t8, s- > pr, s- > pi, c3, s3, s- > mr, s- > mi, cS, sS) B~FLY(r6, r5, t1, t3, t2, t4, t5, t7, tl, t3, t2, t4, t5, t7, t6, t8) B~FLY(p- > mr, p- > pr, p- > mi, p- > pi, q- > mr, q- > pr, q- > mi, q- > pi, ri, r5, il, t4, r2, t3, i2, t7) B~FLY(r- > mr, r- > pr, r- > pi, r- > mi, s- > mr, s- > pr, s- > pi, s- > mi, r3, t2, i3, r6, r4, t5, i4, t1) M~PLY(t1, t5, t6, t2, x- > pr, x- > pi, s2, c2, x- > mr, x- > mi, s2, c2) B~FLY(r3, rl, r4, r2, i3, ii, i2, i4, v- > pr, tl, w- > mr, t2, w- > pi, t5, u- > mi, te) M~PLY(t1, t5, t2, t6, y- > pr, y- > pi, s4, c4, y- > mr, y- > mi, sl, cl) M~PLY(t3, t7, t4, t8, z- > pr, z- > pi, s5, c5, z- > mr, z- > mi, s3, c3) B~FLY(r5, r6, t3, t1, t4, t2, t7, t5, tl, t3, t2, t4, t5, t7, t6, t8) B~FLY(w- > mr, u- > pr, u- > mi, u- > pi, x- > mr, x- > pr, x- > mi, x- > pi, r1, r5, il, t4, r2, t3, i2, t7) B~FLY(y- > mr, y- > pr, y- > pi, y- > mi, z- > mr, z- > pr, z- > pi, z- > mi, r3, t2, i3, r6, r4, t5, i4, tl) q += inc; r += inc; s += inc; w -= inc; x -= inc; y -= inc; z -= inc; /* This is the main function for the 2 x 2 PM DIT DFT structure for complex-valued data. It calls the various functions to compute the DFT of the complex-valued input data.

The header file and the functions are included in the include statements". These statements are not needed if all the functions are in the same file. */ &num;include "pm~2xv.h" &num;include "in~put" &num;include "out~put" &num;include "odd~cyc" &num;include "make~vec" &num;include "dit~pm2x2" main() /* main program for 2x2 PM DIT algorithm */ in~put(); make~vec(); dit~pm2x2(); out~put();

Claims

CLAIMS 1. A set of computational structures, designated as the plus-minus (PM) computational structures, comprising the interconnection of basic computing units, designated as butterflies, for hardware and software implementation of the processes to transform the real-valued or complex-valued one-dimensional or two-dimensional discrete signals from the time-domain to the frequency-domain and vice-versa, the said PM structures comprising two parts: in the first part element vectors formed from the N samples of the given discrete signal, where 2 < IL < log2 N; and in the multistage second part each butterfly operating on 2 element input vectors producing 2 element output vectors.
2. The computational structures of claim 1 realizing the discrete Fourier transform, here after referred to as the DFT, and the inverse discrete Fourier transform, hereafter re ferred to as the IDFT, of real-valued or complex-valued input signals, through a family of radix-2 decimation-in-time and decimation-in-frequency algorithms, designated as plus-minus (PM) algorithms.
3. The computational structures of claim 1 comprising the process of forming element input vectors in the said first part of the said structures by computing point DFTs or IDFTs from the N-point input data.
4. The computational structures of claim 1, wherein the number of bit-reversal operations in the said first part or in the last stage of the said second part of the said structure is N to arrange the data in the required sequence at the input or output, respectively.
5. The computational structures of claim 1, wherein for a naturally-ordered input data, an in-place computation produces a naturally-ordered output data without requiring independent data swapping operations.
6. The computational structures of claim 1, wherein the number of array-index updatings is directly proportional to
7. The computational structures of claim 1 wherein the said second part comprising an interconnected network of several stages each built with butterflies whose input-output relationship comprises the computation of it 2-point DFTs or IDFTs.
8. The computational structures of claim 1, wherein the vector samples at the two input nodes of a butterfly are separated by a number of samples that ranges from 1 to from stage to stage of the said second part of the said structures.
9. The computational structures of claim 1 comprising the feature of combining the multiplication operations of two or more consecutive stages.
10. The computational structures of claim 1, wherein the values from an output node of a butterfly resulting from an in-place computation of each 2-point transform are stored at consecutive memory locations.
11. The computational structures of claim 1 comprising Figures 1 through 21 of the dis closure.
12. The computational structures of claim 1 comprising the software implementations described in the appendix section of the disclosure.
i3. Modifications of the computational structures of claim 1 into structures with each butterfly handling vectors of more than one size.
14. Modifications of the computational structures of claim 1 into structures with butter flies handling vectors of different sizes in different stages.
15. Extension of the computational structures of claim 1 to any positive integer value of the data length N and of any vector length it less than N.
16. Extension of the computational structures of claim 1 to higher-radix, mixed-radix, or split-radix structures.
17. Extension of the computational structures of claim 1 to real-valued or complex-valued signals of more than two dimensions.

Amendments to the claims have been filed as follows
18. A radix-2 decimation-in-time fast-Fourier-transform butterfly arithmetic unit operating on four complex input signals designated first, second, third, and fourth complex input signals, comprising: (a) a first complex multiplier circuit, fed by said third complex input signal and a complex twiddle factor signal, for providing a first multiplier outs lot signal indicative of the product of said third complex input signal and said complex twiddle factor signal; (b) a second complex multiplier circuit, fed by said fourth complex input signal and said complex twiddle factor signal, for providing a second multiplier output signal indica tive of the product of said fourth complex input signal and a 90 degrees phase-shifted version of said complex twiddle factor signal; (c) a first pair of plus-minus units, fed by said first complex input sign: and said first multiplier output signal, for providing a first and a second complex butterfly output signal indicative of the sum and difference of said first complex input signal and said first multiplier output signal; and (d) a second pair of plus-minus units, fed by said second complex input signal and said second multiplier output signal, for providing a third and a fourth complex butterfly output signal indicative of the sum and difference of said second complex input signal and said second multiplier output signal.
19. A radix-2 decimation-in-frequency fast-Fourier-transform butterfly arithmetic unit operating on four complex input signals designated first, second, third, and fourth complex input signals, comprising: (a) a first complex multiplier circuit, fed by said second complex input signal and a complex twiddle factor signal, for providing a first multiplier output signal indicative of the product of said second complex input signal and said complex twiddle factor signal; (b) a second complex multiplier circuit, fed by said fourth complex input signal and said complex twiddle factor signal, for providing a second multiplier output signal indica tive of the product of said fourth complex input signal and a 90 degrees phase-shifted version of said complex twiddle factor signal; (c) a first pair of plus-minus units, fed by said first and said third complex input signal, for providing a first and a second complex butterfly output signal indicative of the sum and difference of said first and said third complex input signal; and (d) a second pair of plus-minus units, fed by said first and said second multiplier output signal, for providing a third and a fourth complex butterfly output signal indicative of the sum and difference of said first and said second multiplier output signal.
20. A radix-2 fast-Fourier-transform butterfly arithmetic unit operating on four complex input signals designated first, second, third, and fourth complex input signals, comprising: (a) a first complex multiplier circuit, fed by said first complex input signal and a first complex twiddle factor signal, for providing a first multiplier output signal indicative of the product of said first complex input signal and said first complex twiddle factor signal; (b) a second complex multiplier circuit, fed by said second complex input signal and a second complex twiddle factor signal, for providing a second multiplier output signal indicative of the product of said second complex input signal and said second complex twiddle factor signal; (c) a third complex multiplier circuit, fed by said third complex input signal and a third complex twiddle factor signal, for providing a third multiplier output signal indicative of the product of said third complex input signal and said third complex twiddle factor signal; (d) a fourth complex multiplier circuit, fed by said fourth complex input signal and a fourth complex twiddle factor signal, for providing a fourth multiplier output signal indicative of the product of said fourth complex input signal and said fourth complex twiddle factor signal; (e) a first pair of plus-minus units, fed by said first and said third multiplier output signal, for providing a first and a second complex butterfly output signal indicative of the sum and difference of said first and said third multiplier output signal; and (f) a second pair of plus-minus units, fed by said second and said fourth multiplier output signal, for providing a third and a fourth complex butterfly output signal indicative of the sum and difference of said second and said fourth multiplier output signal.
21. A radix-2 decimation-in-time fast-Fourier-transform butterfly arithmetic unit comprising the device of claim 20 in which the magnitude of the phase difference between said first and said second complex twiddle factor signal is 45 degrees.
22. A radix-2 decimation-in-frequency fast-Fourier-transform butterfly arithmetic unit comprising the device of claim 20 in which the magnitude of the phase difference between said first and said third complex twiddle factor signal is 45 degrees.