GB2276960A

GB2276960A - Processor and method for DFT computation

Info

Publication number: GB2276960A
Application number: GB9307465A
Authority: GB
Inventors: Keith John Jones
Original assignee: GEC Marconi Ltd
Current assignee: BAE Systems Electronics Ltd
Priority date: 1993-04-08
Filing date: 1993-04-08
Publication date: 1994-10-12
Anticipated expiration: 2013-04-08
Also published as: GB2276960A8; GB2276960B; GB9307465D0

Abstract

An N-point DFT is calculated using arrays 2, 3 of processing elements PE wherein outputs and partial outputs are calculated recursively, advantageously using Cordic arithmetic, to permit systolic computation. More generally, M arrays each of M dimensions may be used, each dimension having the Mth root of N processing elements along it. Partial DFT outputs (e.g. row and column outputs) are produced by processing along the dimensions in respective different arrays (e.g. 2 and 3). Each final DFT output or partial DFT output is produced by the addition of the neck individual input data or partial DFT output to the output of the previous recursion and multiplication by the same complex experimental term or, in the case of the last recursion for all arrays except the Mth array, by a different complex exponential term to take account of the required twiddle factor. The invention permits improved utilisation of processing elements and avoids the need for global communication of data across the arrays. <IMAGE>

Description

2276960 - lk 1 PROCESSOR AND METHOD FOR DFT COMPUTATION This invention

relates to a processor and method for computation of a discrete Fourier transform (DFT).

An N-point discrete Fourier transform (DFT) is defined by:

N-1 X[k]=l x[n]WN n=O nk where:

WN= exp {-i21r/N} the N-th complex root of unity.

k = 0,1,... N-1 For example, a four point DFT would be expressed as follows:

3 X [k] = 1 x [n] exp {-i2nn k/4} n=0 The output X [0] for k=0 would simply be:

X [0] = x [0] + x [1] + x [2] + x [3] The output X [1] for k=l would be:

X [1] = x [0] + x [1] exp {-i2n/4} + x [21 exp {-i2n.2/4} + x [3] exp {i2n.3/4} Similarly the outputs X [2] and X [3] each consist of four terms.

..M .. (2) P/8996/MUT 2 The data x could be samples of a time-varying function, and the outputs will represent the frequency content of the function. X[01 then represents the d.c. (zero frequency) term. X[1] represents its first harmonic. When the outputs have been evaluated, the four values would represent the frequency content of the time samples at four-spot values of frequency. Equally, the data x could be frequency values and the output would represent time samples. More generally, the data x could represent any function.

Referring to Figure 1, a nine-point DFT could thus be produced by a linear array of nine processing elements 1-9. The data x [8] to x[O] could be fed along the array, and each processing element could produce one output X[O] to X[81. It will be apparent that all the data samples will need to be fed through each processing element in order for that processing element to produce an output. If the data was fed linearly, eight time steps would elapse before the fast processing element received its first data sample, and sixteen would elapse before it received its last data sample. It would thus produce its last output after seventeen time steps.

In the interests of better utilisation of processing elements, it has been proposed to re-order the one-dimensional input data via a lexicographic mapping into the form of a two-dimensional array of size N by N, where N is a positive integer. For example, referring to Figure 2, the nine-point data of Figure 1 could be fed to a two-dimensional array of processing elements 1 -9. The data x[O], x[l],... x[8] could il P/8996/MUT 3 be mapped into data of the form x [ij], where Q = 0,1,2, as follows:

X[0,0] X[1,0] x[2,0] X[O, 1] X[1,1] x[2,11 x[0,21 x[1,2] x[2,2] Thus, the data of the first row is fed into the first row of processing elements 1-3, at the same time as the data of the second and third rows is fed into the second and third rows of processing elements, respectively.

In general, the original one-dimensional DIFT of equation 1 reduces, via the application of this data re-ordering, to the two-dimensional formulation given by:

N-1 N-1 X[kI,k2] = 1 [ 1 [x [ni,n21 WN 1 WN 1 WN nj =0 n2 =0 MA n2k, nlkl 1 [y [nl k2IWN 1 W4N nj =0 where:

N-1 y[n,,k2] = 7, x[nj,njW-M fl =0 for,k2= 0, 1,..., N-1.

n rlk2 n2k, nlkl .. (3) .. (4) .. (5) 1 P/8996/MUT 4 Thus, the DFT output can be produced by performing a set of row-DFTs (equation (5)), followed by a set of column-DFTs (equation (4)). It will be noted however that before the latter can be performed, a pointwise multiplication of the rowqkl DFT outputs must be performed to take account of the exponential term n. These terms are referred to in the literature as "twiddle factorsu.

Reverting to the three by three processing element array in the example of Figure 2, the three rows of data are fed into the three rows of processing elements. The partial DFT outputs (the row-DFT outputs although clearly the vertical lines of processing elements could be regarded as rows) are produced when the three data samples have passed through the three processing elements 3,6,9 i.e. after five processing steps. Only when these steps have taken place and each processing element 1 -9 holds a partial (row) DFT output can pointwise multiplication by the twiddle factors take place.

Then column-DFTs can be performed. Each processing element of each column requires the three row-DFT outputs as modified by the twiddle factors. For this reason, intermediate memories M1-3 must be provided. The modified partial outputs are transferred in columns to the respective memories, and the computation of the column DFTs can then commence.

These column DFT outputs are the required DFT outputs (no twiddle factors 1 P/8996/MUT being necessary for the columns). However, this is in two-dimensional form and lexicographic mapping is necessary to place the output in one- dimensional form.

The complete process may be summarised as in Figure 3. It is, apart from the data re-ordering, an essentially three-stage process, namely:

a partial-DFT, as given by equation 5 (with r varying for fixed nj, referred to as the "row-DFT" stage, comprising N DFTs each of length ^ followed by (b) a pointwise multiplication of the partial-DFT output of equation 5 to rbkl account for the twiddle factors, WN, followed by a second partial- DFT, as given by equation 4 (with R varying for fixed k2), referred to as the "column-DFT" stage, comprising N DFrs each of length ^ to give the final DFT output.

While there is better utilisation of processing elements than for a linear array of processing elements, the utilisation is still far short of optimum, and there are global communication problems. Thus, in the simple example of Figure 2, five time steps are required for all the row- DFT outputs to be calculated (more generally (2N-1)), and the same for the column-DFT outputs. The processing elements are therefore only utilised roughly 50% of the time.

P/8996/MUT 6 Also, as will be apparent by referring to the output X[1] for the four point DFT referred to above, each processing element 1-9 needs access to a number of complex exponential terms (the number depending on the size of the array), and there is consequently a communication problem. In one prior proposal (L.W. Chang and M.Y. Chan, NA New Systolic Array for Discrete Fourier Transform", IEEE Transactions, ASSP, Vol-36, No-10, October 1988, pages 1665-1666), the complex exponential terms of the Fourier matrix are pipelined through the processor array together with the DFT input data, and buses are required for implementation of the column-DFT process. Because of the size of the processing elements required, the array is only capable of computing small DFTs.

The inventor has devised a novel form of calculating DFTs which permits improved utilisation of processing elements and obviates the need for global communication or transmission of data across the array.

The inventor has appreciated that, for each processing element, the series of complex exponential terms can be replaced by a single exponential term which is applied several times in a recursive manner (e. g. in the equation for X[1], only exp{-i2n/4} needs to be stored). Further, the twiddle factors can be taken account of by using a different exponential term on the last recursion. Thus, each processing element now only needs to store two complex exponential terms for the row-DFTs and a single exponential term for the column-DFTs.

1 P/8996/MUT 7 Thus, each partial-DFT coefficient (for either row-DFT or column-DFT stages), denoted Y[k], can be simply computed as the inner product of the one-dimensional complex data vector, denoted (y[n]j, and the appropriate row of the qN x qN Fourier matrix:

0 k (44-1)k Y1k] = Y[O] W4N + Y11M4N ±--+ y [N-1]W-JN which can in turn be regarded as a N-stage recursive equation:

Y[k](0) = 0 li-llk .. (6) Y[k](j) = Y[k](j) + yg-l]WN j = 1,2,., N... (7) where the operation associated with each recursion is the same, namely that of complex multiplicationlaccumulation.

This expression can be further simplified, however, by noting that all 4N terms of the k-th row (or column) of the Fourier matrix can be simply generated with the single complex exponential term:

k W4N= exp{-i2knlN} .. (8) where the initial term of each row (or column) is given by 1. The significance of this property is evident when the k-th partial-DFT coefficient is recursively computed from the data by Horner's rule [3]:

P/8996/MUT 8 Y[k] = (... ((Y[O1 a + y [N-1]) a + y [N-2]) cc ±--+ y[l]) a .. (9) where a corresponds to the exponential term of equation 6, as it suggests that only one complex exponential term need be pre-computed and stored for each partial-M coefficient.

Note, however, that as ()e = a VN = 1, the term y[O] is fed first, rather than fast, into the above recursive equation, ensuring the operation associated with all N recursions is the same, namely that of complex accumulation/multiplication, with the fixed complex multiplier being the same component of the Fourier matrix for each recursion.

To take an example, X[1] referred to above can be calculated in four recursions X(')[1] -)0[1], where X)[1] is initialised to zero and a is exp{A2n14}, viz., X(O)[1] 0 X(')[1] (X()[1] + X[O]) a X(% 1] = (X(')[ 1] + x[3]) a X(3)p] = (X(2)p] + x[2]) a X(4)p] = (X(3)p] + X[g (X .. (10) The operation performed in each step is the same, namely, addition of the next input P/8996/MUT 9 data sample to the previous recursion output, followed by multiplication of the same complex exponential term. Two points are to be noted. First, x[O] is multiplied by d, which is possible since exp{-i27e 4/4} is unity. Second, the multiplication of the final output for X[1] by the twiddle factor, can be accomplished by multiplying by P rather than a in the last line above.

The invention provides a processor for computing an N-point DFT from input data of length N, comprising a qN by qN array of processing elements, N being a positive integer, for producing row-DFT outputs, modified by respective twiddle factors, from the input data re-ordered to a corresponding qN by qN array, a qN by qN array of processing elements for producing column-DFT outputs from the modified row-DFT outputs, wherein each individual output of the modified row-DFTs and column DFTs is produced recursively by the addition of the next individual re-ordered input data, or modified row-DFT output, respectively, to the output of the previous recursion and multiplication by the same complex exponential term or, in the case of the last recursion for the row-DFT outputs, by a different complex exponential term to produce the modified row-DIFT outputs, the modified row-DFTs and column-DIFTs being performed in different 4N by 4N arrays of processing elements.

The elimination of the multiplication of the row-DFT outputs by the twiddle factors as a separate step applying to the row-DFT outputs only and not to the column-DFT outputs, and the provision of separate arrays for computing the row and P/8996/MUT column-DFT outputs, permits improved flow of data through the processor to be accomplished.

For an efficient implementation of the recursion of equation 10, it should be noted that as the components of the Fourier matrix are complex exponentials, it is possible to carry out the complex multiplications within the inner product operation, via a conventional fast multiplier or as simple phase rotations with bit-recursive CORDIC arithmetic.

The conventional CORDIC technique involves the implementation of a coordinate rotation via a combination of computational and table look-up techniques. The key to the technique is the decomposition of the phase rotation angle into the sum of several smaller rotation angles, each of which can be simply implemented by means of additions and shifts. This is normally followed by the application of a scaling factor to correct for the ensuing magnification.

The invention also provides a method of computing an N-point DFT from input data of length N, using a N by N array of processing elements, N being a positive integer, for producing row-DFT outputs, modified by respective twiddle factors, from the input data re-ordered to a corresponding N by N array, and using a N by N array of processing elements for producing column-DFT outputs from the modified row-DFT outputs, the method comprising producing each individual output of the P/8996/MUT 11 modified row-DFTs and column DFTs recursively by the addition of the next individual re-ordered input data, or modified row-DFT output, respectively, to the output of the previous recursion and multiplying by the same complex exponential term or, in the case of the last recursion for the row-DFT outputs, by a different complex exponential term to produce the modified row-DFT outputs, the modified row-DFTs and columnDFTs being performed in different 4N by 4N arrays of processing elements.

More generally, the invention is applicable to a processor and a method for computing an N-point DFT from input data of length N using aNN by NN by NN array 3 of processing elements, qN being a positive integer, or in the most general terms, a NN by NN by NN M times, where NN is a positive integer. With the M- dimensional solution, M arrays of processing elements are required with the twiddle factors between the (m-1)-th and m-th partial-DFT processors, for m = 2, 3. M, being built into the last recursion of each partial-DFT coefficient of the first M-1 partial-DFT processes.

A processor and method for computation of a DFr constructed in accordance with the invention, will now be described, by way of example, with reference to the accompanying drawings, in which:

Figure 4 shows the architecture of the processor in schematic form; P/8996/MUT 12 Figure 5 shows a processing element in schematic form; and Figure 6 shows the detailed implementation of a processing element.

Referring to Figure 4, the processor comprises an input buffer 1, a rowDFT cordic processor array 2, a column-DFT cordic processor array 3, and an output buffer 4.

The input buffer 1 holds one-dimensional input data at length N (e.g. 1024 samples) which has been re-ordered via lexicographic mapping into the form of a twodimensional array of size N by M Each individual item of data, including guard bits and expansion bits, is L bits in length (e.g. 32 bits).

The output buffer 4 holds the DFT output in two-dimensional form which requires inverse mapping to produce a one-dimensional DFT output.

In a first time step, the first column of input data in the input buffer 1 is transferred to the first column of processing elements (PEs) of the row-DFT processor array 2. In a second time step, the first column of input data is transferred to the second column of processing elements, while also being held in the first column of processing elements, and so on.

j P/8996/MUT 13 Consider, by way of example, the case of a 9-point DFr. A 3 x 3 array 2 of PEs is assigned to the row-DFT stage and a 3 x 3 array 3 of PEs to the column-DFT stage. Denoting the PE in the i-th row and j-th column by PE[Q], after 3 time-steps, (where one 'time-step" corresponds to the amount of processing time required to carry out an accumulation/rotation operation) the partial-DFT coefficients computed by PE[1,1], PE[2,1] and PE [3,1] will have been produced and therefore be ready for pipelining into the linear array 3 comprising PE[4,1], PE[5,1] and PE[6,1]. After a total of 6 time-steps the final DFT output of PE[4,1] will be available for pipelining out of the array and into the output data buffer. After a total of 7 time-steps the same will be true for PE[5,1], and after a total of 8 time-steps the same will be true for PE[6,1].

In a similar fashion, the partial-DFT coefficients computed by PE[1,2], PE[2,2] and PE[3,2] will have been produced after a total of 4 time-steps, and those computed by PE[1,3], PE[2,3] and PE[3,3] after a total of 5 time-steps. The outputs of PE[1,2], PE[2,2] and PE[3,2] are pipelined into the linear array comprising PE[4, 2], PE[5,21 and PE[6,2] whilst the outputs of PE[1,3], PE[2,3] and PE[3, 3] are pipelined into the linear array comprising PE[4,3], PE[5,3] and PE[6,3].

After a total of 7 time-steps the final DFT output of PE[4,2] will be available, after a total of 8 time-steps the same will be true for PE[5, 2], and after a total of 9 time-steps the same will be true for PE[6,2]. Finally, after a total of 8 time-steps the output of PE[4,3] will be available for pipelining out of the array and into the output P/8996/MUT 14 data buffer, after a total of 9 time-steps the same will be true for PE[5, 3], and after 10 time-steps the same will be true for PE[6,3].

Therefore, after 10 time-steps the final output for the first 9-point DFT will have been computed.

The input buffer 1 may contain further 3 x 3 arrays of mapped 9-point input data. These may be fed into the row-DFT processor array 2, a column at a time, each time step. Thus, the processor computes successive DFTs, without interruption. That is, after a further 3 time-steps the final output for a second 9-point DFT will have been computed; after a further 3 time-steps the final output for a third 9-point DFT will have been computed, etc.

In general, the 4N-point DFTs of the row-DFT stage are implemented in parallel via qN linear arrays each comprising N PEs, whilst the 4N-point DFTs of the columnDFT stage are implemented in parallel via 4N linear arrays each comprising 4N PEs.

A processor array of N x N PEs is therefore assigned to the computation of the row-DFT stage, and a second processor array of N x N PEs assigned to the computation of the column-DFT stage, as shown in Figure 4. The input to each rowDFT linear array is the appropriate row of data in the input buffer, which is pipelined through all N PEs in the row-DFT linear array. The input to each column-DFT linear i P/8996/MUT array is the partial-DFT output from the appropriate column of PEs in the row-DFT processor array, which is pipelined through all N PEs in the column-DFT linear array.

In accordance with the invention, the twiddle factors, which have to be applied to the row-DFT output before the row-DFT output can be fed into the column-DFT processor array, are built into the row-DFT stage via Horner's rule:

Y[k] = (... ((Y[O1 a + y [N-1]) a + y [N-21) a +... + y[l]) a where:

.. (11) 0 -- O WN(k) with WN(k) the twiddle factor corresponding to the k-th partial-DFT coefficient of the row-DFT being computed, so that the twiddle factor is applied via the last CORDIC rotation of each row-DFT coefficient, at no additional expense in terms of the overall timecomplexity.

For the case of an N-point DFT, a total of 4 N pins are required for the pipelining of data into and out of the processor array, whilst all components of the original N x N Fourier matrix are pre-computed, encoded via CORDIC arithmetic, and appropriately distributed in local memory across the arrays 2, 3. Each row-DFT PE requires two components in local memory, this including one component for application of its corresponding twiddle factor, and each column-DFT PE requires just one component. Only one encoded scale factor need be stored by each PE.

P/8996/MUT 16 All final DFT outputs must therefore reach the output data buffer 4 within a single time-step in order that internal registers of each PE are available for holding the next set of final DFT outputs.

Timing information is needed for synchronising the systolic operation of the array, this being distributed by means of the H-tree scheme shown in Figure 4 which ensures the timing source is equi-distant from all PEs within the array, thus minimising clock skew. Control information, which comprises 1 bit per PE, is also distributed across the array via the same H-tree scheme, this bit proving sufficient for correct sequencing of the PE's internal operations.

Figure 5 shows the high-level description of each processing element (PE). For the case of a row-DFT PE (see Figure 5), the incoming data from the input data buffer is pipelined horizontally (eastwards) into registers of the PE, H-REGR and H-REGI, added into accumulation registers, A-REGR and A-REGI, which then undergo rotation and scaling. When the partial-DFT coefficient is computed it is transferred from the accumulation registers to registers, V REGR and V-REGI, ready for pipelining vertically (southwards) into the column-DFT processor array. The accumulator is then available for the next data set corresponding to the next DFT to be computed.

For the case of a column-DFT PE (see Figure 5), the incoming data from the row-DFT processor array is pipelined vertically (southwards) into registers, V-REGR P/8996/MUT 17 and V_REGI, added into accumulation registers, A-REGR and A_REGI, which then undergo rotation and scaling. When the final DFT coefficient is computed it is transferred from the accumulation registers to registers, H-REGR and H-REGI, ready for pipelining horizontally (westwards) out of the column-DFT processor array and into the output data buffer 4, one bit every micro-cycle (defined hereinafter). The accumulator is then available for the next data set corresponding to the next set of row-DFT outputs.

Figure 6 shows the logic-level description of each processing element (PE). It will be thus apparent that each PE includes two "input" registers, one real and one imaginary, either for receiving input data pipelined horizontally in the case of a rowDFT PE, or for receiving modified row-DFT output data pipelined vertically in the case of a columnDFT PE. These two input registers, denoted as H-REGR and H-REGI, and VREGR and V-REGI in Figure 5, are denoted as XRin and Xlin in Figure 6. Each PE also includes accumulation registers A-REGR and A-REGI for real and imaginary components of the DFT outputs. These are denoted AR Register and AI Register in Figure 6.

Each PE also includes local memory, denoted cordic memory in Figure 5 and not shown in Figure 6, which stores, encoded in Cordic arithmetic, two complex exponential terms for each row PE and one complex exponential term for each column PE. The terms are thus stored as a sequence of 1 s and Os, denoted {d j in Figure P/8996/MUT 18 6. A brief description of Cordic arithmetic is given in OThe CORDIC Trigonometric Computing Technique" byJ.E. Voider, IRETrans. on Electronic Computers, 1959, EC8, (3), pp 330-334.

The requirements of each PE may be illustrated by imagining the computation j of the first harmonicX[1] of the first rowof a 4x4 DFT. In the first time step, the first PE of the first row receives an input data sample x[O] which must be multiplied by a complex exponential term (following equation 10 an angle (x) to produce a product. In the second time step, the first PE of the first row receives a second input data sample x[3]: this must be added to the product, and multiplied by the same complex exponential term. In the third time step, the first PE of the first row receives a third input data sample x[2]: this must be added to the accumulated product and multiplied by the same exponential term. In the fourth time step, the first PE of the first row receives a fourth input data sample x[4]: this again must be added to the accumulated product, but this time multiplied by a different complex exponential term 0.

The input data samples are complex, and are consequently entered in two registers XR, Xl, one for the real part and one for the imaginary part.

The accumulated complex term to which the next incoming data samples are added are stored in registers AR, AI, one for the real component, one for the imaginary component.

P/8996/MUT 19 The real and imaginary parts of the accumulated complex term and the next incoming data samples are added, respectively, in full adders FA1, FA2. Each real and imaginary part could be several e.g. 32 bits long. However, they are added bit by bit as each n-bit word is fed through the respective full adder.

The multiplication by the complex exponential term is carried out by CORDIC arithmetic. Such a multiplication is equivalent to a rotation of the relevant twodimensional vector through a certain angle. The multiplication process can be simplified by notionally breaking it into a series of small-angle rotations which together make up the angle. By choosing a series of appropriate small angles viz tarl' (20), tan-' (2-'), tan-' (2-2) etc, it turns out that each small-angle rotation of an n-bit complex word can be performed by shifting the conjugate of that word by several bits, and then adding it to, or subtracting it from, that n-bit complex word. It follows that phase rotation of the n-bit complex word by a given angle can be encoded into a series of additions or subtractions of the conjugate of the complex data word resulting from the latest smallangle rotation, each time bit-shifted by a further place.

The bit shifting is performed by switches (3) in Figure 6. The series of 1 s and Os corresponding to additions and subtractions is held in the cordic memory as the sequence {d J.

An unwanted side effect of such successive small angle rotations is that the P/8996/MUT resultant vector becomes scaled by some factor. This scaling is cancelled by similar additions to and subtractions from the n-bit word, the appropriate series of successive bit shifted versions, this time not of the conjugate of the word but the word itself.

Each bit by bit addition takes (in the full adders) what may be termed a microcycle. It follows that a time step must exceed N x L microcycles.

The conventional CORDIC algorithm is written as:

xr,., 8i 2 xr, Xii+l Xii .. (12) where (xr,, xi,), a complex number with real component x and imaginary component xi,, is the output of the i-th iteration and (xr,+,, A +,) the output of the (i+l)-th iteration, and where A equals -1 or +1. The number of CORDIC iterations determines the precision of the rotated output. The magnification factor, which is given by:

L-1 H 1 -(2")2... (13) k=0 for the case of L-bit data, is approximately 1.6 per rotation for a wordlength of 12 or more bits.

An additional 2 iterations are included in order that the convergence of the CORDIC algorithm be extended from the conventional range of convergence, namely -1.7433 to +1.7433 radians, to the range -n to +n radians, as required for arbitrary 1 P/8996/MUT 21 phase rotation angles. When these additional iterations are applied, the associated magnification factor increases to a value of approximately 3.2 per rotation.

The CORDIC rotation is carried out, in terms of minimum hardware requirements, via the bit-serial circuitry of Figure 6 using (apart from the registers) 2 single-bit full adders (FA1, FA2),2 logical EXCLUSIVE-OR circuits (X0R1, X0R2) and 2 logical AND circuits (AND1, AND2). In addition, by expressing the scaling factor, Kc, in the form:

L KC = rI (1 - K 2') k=l where P. = 0 or 1, the output of the i-th iteration can be scaled, with scaled output (xr!, A j), and written as:

xr,' xri - P, 2-' xr, xi,l xi, - pi 2 4 xi,... (15) thus enabling the scaling to be interleaved with the iterations of the CORDIC rotation and implemented with the same bit-serial circuitry.

.. (14) For a description of how this circuitry works, suppose firstly that the accumulator, with real and imaginary components represented respectively by the registers AR and AI, already contains accumulated data, denoted ar and ai, respectively, this to be added into, rotated and scaled. When the switches labelled (1) are set in the upward position, the 2-s complement form of the data, xr and xi, to

P/8996/MUT 22 be added into the accumulator, are shifted into their respective registers, XR and Xl, one bit at a time and least significant bit (LSB) first. This corresponds to input data being shifted eastwards or modified row-DFT output data being shifted southwards.

When the most significant bit (MSB) is shifted in, the switches labelled (2) are set in the upward position so that the sign-bits of the accumulator are latched into the sign-bit registers denoted AR-sign and AI-sign. These sign-bits are used to signextend the values of ar and ai when they are bit-shifted.

After the values of xr and A have been shifted into their respective registers, the switches labelled (1) are set to the downward position. Full adder FA1 is then used to add: serial versions of xr and ar toproduce an updated value of ar (accumulation); serial versions of ar and 2-' ai to produce an updated value of ar (rotation); and serial versions of ar and 4 ar to produce an updated version of ar (scaling).

Similarly, full adder FA2 is used to add: serial versions of A and ai to produce an updated version of ai (accumulation); serial versions of ai and -8,2 ar to produce an updated version of ai (rotation); and serial versions of ai and -2'ai to produce an updated version of ai (scaling).

The accumulation operation is first carried out, with the values of xr and xi P/8996/MUT 23 being fed back to the left ends of their registers XR, Xl whilst they are being added into the accumulator AR, AL All four registers, corresponding to both the data and the accumulator, are simultaneously right-shifted (so serial addition, one bit at a time, takes place) with the output of the FAs being fed back to the left ends of the accumulation registers.

To create bit-shifted versions of the ar and ai terms, the bits of lower significance are deleted and replications of the sign-bit are extended to the left of the MSB location to preserve the wordlength. The bits of lower significance are truncated.

Suppose each register is of length L (this corresponding to the length of the data added to the number of guard bits and expansion bits) and that the CORDIC operation is to be carried out via 2K iterations, i.e. K smallangle rotations and K scaling operations producing K-bit precision output. Both the rotation angle and scaling factor are encoded via the sequence of binary signals {dJ, where for the rotation operation:

d j = 0 corresponds to 3, = -1, and d 1 = 1 corresponds to 8, = +1 whilst for the scaling operation:

d j = 0 corresponds to 0, = 0, and d j = 1 corresponds to P j = 1.

(see equation 12) (see equation 15) P/8996/MUT 24 For the iteration corresponding to the small-angle rotation:

ar = ar + 8,2 'a! ai = ai - 8i2-2ar... (16) the switches indicated with a (3) in Figure 6 are closed and remain closed for each of the L micro-cycles of this iteration (where one Omicro- cyclen corresponds to the amount of processing time required to produce one bit of the data pair (arl, aii.1) from (ar,, ai,). All other switches to the right and to the left of the switches labelled (3) which are in parallel with these switches are left open. This permits only one shift- register output to have access to each signal line at any given time.

On the first micro-cycle of this iteration, the bit located 2 places to the left of the LS13 in the AR (AI) register is enabled onto the lower line and sent to the A-input of the FA2 to the right of the AI (AR) register. Before reaching FA2, however, there is a possible negation to take place, depending upon the value of 8 j. If 8 j +1, then 2-2ar (2 2ai) must be subtracted from ai (added to ar). However, if 8i -1, then 2 2 ar (2 -2ai) must be added to ai (subtracted from ar).

If 8 j = -1, then d j= 0 and the XOR gate X0R1 (X0R2) connected to the Ainput of the FA circuit FA1 (FA2) will pass the value of 2"2ar (2 ' ai) unchanged but will take the 1 -s complement of 2 -2 ai (2 -2 ar). In this situation, it is necessary to add 2 ' ar to ai (2 ' ai to ar) and to subtract 2 -2 ai from ar (2 -2 ar from ai). If, however, 8 1 = +1, then d j = 1 and the XOR gate X0R1 (X0R2) will take the 1 -s complement P/8996/MUT of 2-2ar (2 -2ai) but leave the value of 2 ' ai (2 ' ar) unchanged.

The carry-in to the LS13 of the FAs, FA1 and FA2, is 0 if an addition is taking place. In the case of a subtraction, ai (ar) is added to the 1-s complement of 2-2ar (2-2ai) with 1 as a carry-in to the LS13 position. This operation is equivalent to the sum of ai (ar) and the 2-s complement of 2-2ar (2 -2 ai). During the other microcycles of each iteration, the carry-out of the previous micro-cycle is used as the carryin for the current micro-cycle.

Note that the position of the switch labelled (3) in Figure 6 corresponds only to bit shifting by two places (i.e. multiplication by 2-2). For shifting by three places, for example, the switches one place further to the left would be utilised.

The 2 x 2 switch labelled (4) in Figure 2 is used to alternate the function of the circuitry between that of rotation and scaling. The AND gates (AND1 and AND2) to the right of the XOR gates ensure that the output of the XOR gates pass unchanged to the FAs for the case of the small-angle rotations, but that they are zeroised, when pi = 0 (from equation 13), for the case of the scaling operation).

Note that the inverter 5 in Figure 6 ensures that -8, is used for the updating of the imaginary component of the output of each small-angle rotation and that -0, is used for the scaling operation.

1 P/8996/MUT 26 Also, the latch 6 containing the binary 1 is used as input to the AND gates for the rotation iterations, whilst the d, terms are used as input to the AND gates for the scaling iterations, enabling the same circuitry to be used for both the rotation and the scaling iterations. The d, and terms (the inverter output) are also fed as input to the two FAs.

In summary, the novel features of the processor are:

(a) achievement of 0 (N) time complexity (in fact, N outputs every N timesteps) using nearest-neighbour communication only (i.e. no global communication or transmission of data across the array) and 100% utilisation of the PEs; (b) elimination of the twiddle factors between the row-DFT and column- DF17 stages, this achieved by incorporating them into the last recursion of the computation of the row-DFT outputs via Horner's rule; (c) elimination of intermediate memory, the row-DFT output being pipelined directly into the column-DFT processor array; and (d) design of a novel bit-serial CORDIC PE for implementation of both the accumu lation/rotation and scaling operations, making the proposed processing scheme actually realisable, for potentially large DFTs, in very-large-scale-integration (VI-SI) P/8996/MUT 27 technology.

Of course variations may be made without departing from the scope of the invention. Thus, the multiplications need not be carried out using CORDIC arithmetic. More importantly, the invention is not restricted to the processing of an N-point DFT using data re-ordered into a qN by qN array and using a qN by qN array of processing elements for processing the rowDFTs and another for processing the column-DFTs. Thus, the invention is applicable to re-ordering the data into a 34N by34N by 34N array, and processing partial-DFT outputs along orthogonal axes of threeNN by NN by IVN arrays of processing elements. For a 1024 point DFT, thirty-two time steps are required for producing 1024 outputs with a square array, but only around 10 time steps are needed for the same number of outputs with a cubic array. The throughput is increased by a factor of 3 for an increase in hardware of one and a half times. The actual processing for the cubic array would of course be carried out as described above, for the square array. Furthermore the invention is applicable to re-ordering the data in M dimensions of side NN and processing using M such arrays of processing elements. Such a solution would produce N outputs everyNN time- steps at a total hardware cost of MN processing elements.

P/8996/MUT 28

Claims

1. A processor for computing an N-point DFT from input data of length N, comprising a qN by qN array of processing elements, qN being a positive integer, for producing row-DFr outputs, modified by respective twiddle factors, from the input data re-ordered to a corresponding qN by qN array, a qN by qN array of processing elements for producing column-DFT outputs from the modified row-DFT outputs, wherein each individual output of the modified row-DFTs and column DFTs is produced recursively by the addition of the next individual re-ordered input data, or modified row-DFT output, respectively, to the output of the previous recursion and multiplication by the same complex exponential term or, in the case of the last recursion for the row-DIFT outputs, by a different complex exponential term to produce the modified row-DFT outputs, the modified row-DFTs and column-DFTs being performed in different qN by qN arrays of processing elements.

2. A processor as claimed in claim 1, including adders for adding, respectively, the real and imaginary parts of multi-bit words representing the re-ordered input data or modified row-DFT output, and the previous recursion output, one bit at a time as the words are fed serially through the adders.

3. A processor as claimed in claim 1 or claim 2, including registers for accumulating the real and imaginary parts of the recursion outputs.

1 P/8996/MUT 29

4. A processor as claimed in claim 3, including switches to access the registers at a position between its ends for producing bit-shifted versions of the recursion outputs for use in calculating the recursions by Cordic arithmetic.

5. A processor as claimed in claim 1, in which the processing elements are arranged to perform the multiplication using a series of small angle rotations using Cordic arithmetic.

6. A processor for computing an N-point DFT from input data of length N, 3 comprising three NN byNN by3qN arrays of processing elements, N being a positive integer, for producing DIFT outputs from input data re-ordered to a corresponding three-dimensional array of size34N by34N by34N, wherein in use partial DFT outputs are produced by processing along the respective dimensions in the different arrays, the final DFT outputs being produced in the third array, each individual final DIFT output or partial DFT output being produced recursively by the addition of the next individual input data or partial DFr output to the output of the previous recursion and multiplication by the same complex exponential term or, in the case of the last recursion for the first two arrays for producing partial-DFT outputs, by a different complex exponential term to take account of the required twiddle factor.

7. A processor for computing an N-point DIFT from input data of length N, comprising M M-dimensional arrays of processing elements of size NN by NN by NN P/8996/MUT .... M times, NN being a positive integer, for producing DFT outputs from input data re-ordered to a corresponding M-dimensional array, wherein in use partial DFT outputs are produced by processing along the respective dimensions in the M-arrays, the final DIFT outputs being produced in the Mth array, each individual final DFT output or partial DIFT output being produced by the addition of the next individual input data or partial-DFT output to the output of the previous recursion and multiplication by the same complex exponential term or, in the case of the last recursion for all arrays except the Mth array, by a different complex exponential term to take account of the required twiddle factor.

8. A processor for computing an N-point DFT from input data of length N, substantially as herein described with reference to the accompanying drawings.

9. A method of computing an N-point DIFT from input data of length N, using a qN by qN array of processing elements, qN being a positive integer, for producing rowDFT outputs, modified by respective twiddle factors, from the input data re-ordered to a corresponding qN by N array, and using a 4N by 4N array of processing elements for producing columnDFT outputs from the modified row-DFT outputs, the method comprising producing each individual output of the modified row-DFTs.and column DFTs recursively by the addition of the next individual re-ordered input data, or modified row-DFT output, respectively, to the output of the previous recursion and multiplying by the same complex exponential term or, in the case of the last recursion h.

P/8996/MUT 31 for the row-DFT outputs, by a different complex exponential term to produce the modified row-DFT outputs, the modified row-DFTs and column- DFTs being performed in different 4N by 4N arrays of processing elements.

10. A method of computing an N-point DFT from input data of length N substantially as herein described.