CA2428393A1

CA2428393A1 - Implementation of discrete wavelet transform using lifting steps

Info

Publication number: CA2428393A1
Application number: CA 2428393
Authority: CA
Inventors: Bruce F. Cockburn; Mrinal K. Mandal; Hongyu Liao
Original assignee: Telecommunications Res Labs
Current assignee: Telecommunications Res Labs
Priority date: 2003-05-09
Filing date: 2003-05-09
Publication date: 2004-11-09

Abstract

Compact and efficient hardware architectures for implementing lifting-based DWTs, including 1-D and 2-D versions of recursive and dual scan architectures. The 1-D recursive architecture exploits interdependencies among the wavelet coefficients by interleaving, on alternate clock cycles using the same data path hardware, the calculation of higher order coefficients along with that of the first-stage coefficients. The resulting hardware utilization exceeds 90% in the typical case of a 5-stage 1-D DWT operating on 1024 samples. The 1-D
dual scan architecture achieves 100% datapath hardware utilization by processing two independent data streams together using shared functional blocks. The 2-D
recursive architecture is roughly 25% faster than conventional implementations, and it requires a buffer that stores only a few rows of the data array instead of a fixed fraction (typically 25% or more) of the entire array. The 2-D dual scan architecture processes the column and row transforms simultaneously, and the memory buffer size is comparable to existing architectures. The recursive and dual scan architectures can be readily extended to the N-D
case.

Description

IMPLEMENTATION OF DISCRLTE WAVEL,ET TRANSFORM USING LIFTING STEPS
BACKGROUND OF 'THE INVENTION
OI 1'he advantages of the wavelet transform over conventional transforms, such as the Fourier trans(~orm, are now well recognized. In many application areas, the wavelet transform is more efficient at representing signal features that are localized in both time and frequency.
Over the past 15 years, wavelet analysis has become a standard technique in such diverse areas as geophysics, meteorology, audio signal processing, and image compression.
Significantly, the 2-D biorthogonal discrete wavelet transform (DWT) has been adopted in the recently established the new JPEG-2000 still image compression standard.
02 The classical DWT can be calculated using an approach known as Mallat's tree algorithm. Here, the lower resolution wavelet coefticients of each DW'r stage are calculated recursively according to the following equations:
~~, ~~~ _ ~~'r", ~h[m-2k] ~1~
", d, ~,, _ ~~, ", ~ ~f~rz k] (2>
where c~,,N is the g'~' lowpass coefficient at the p'~' resolution, d~,,~~ is the g'~' highpass coefficient at the p'~' resolution, h[] is the lowpass wavelet filter corresponding to the mother wavelet, and g[] is the highpass wavelet fitter corresponding to the mother wavelet, 03 The corresponding tree structure for a two-level DWrh is illustrated in Fig. 1. As shown in Fig. 1, the forward transform is computed using a series of high and low pass filters, denoted by g(-n) and h(-n), respectively, that operate on an input c; at increasing resolutions along the dimension of the sample index n. The decimated (i.e., down-sampled by a Factor of two using decimators down, ~.2) output of the high pass filters at dil-ferent stages (d;_,, d;_?, ...) captures the detail information at different resolutions. The decimated output of each low pass filter e.g. (c~_)) is processed recursively by the low and high pass filters of the next stage to obtain c~_~ and d~_~. Finally, the decimated output of the low pass filter of the last stage corresponds to the low frequency content of the original signal at the lowest considered resolution. In Mallat's algorithm, the inverse transform is calculated using a reverse tree algorithm that repeatedly'Iters and interleaves the various streams of transform coefficients back into a single reconstructed data sequence.
04 The structure of the corresponding separable 2-D DWT algorithm is shown in Fig. 2, where G and H represents the lowpass and highpass subband filters, respectively. The input image is first decomposed horizontally; the resulting outputs are then decomposed vertically into four subbands usually denoted by Lh, LFI. tfl., and NN. The LL subband can then be further decomposed in the same way.
OS In 1994, Sweldens proposed a more efficient v~~ay of constructing the biorthogonal wavelet bases, called the lifting scheme. ('oncurrently, similar ideas were also proposed by others. The basic structure of the lifting scheme is shown in Fig. 3. The input signal s,.~. is first split into even and odd samples. The detail (i.e.. high frequency) coefticients c~,_l h. of the signal are then generated by subtracting the output o1~ a prediction function P of the odd samples from the even samples. The smooth coefficients (the low frequency components) are produced by adding the odd samples to the output of an update function U of the details. The computation of either the detail or smooth coefficients is called a Iifiing .seep.
06 Daubechies and Sweldens showed that every FIR wavelet or filter°
bank can be factored into a cascade of lifting steps, that is, it can be represented as a finite product of upper and lower triangular matrices and a diagonal normalizaticm matrix. The high-pass filter g(~) and low-pass filter h(~) in Equations 1 and 2 can thus be rewritten as:
_) (p) 6~(z) _ ~~hi(z) (4) j where .I is the filter length. We can split the high-pass and low-pass filters into even and odd parts:
it(~l - ly~(=W +-~-~i~~,(~?) (6) I~he filters can also be expressed as a polyphase matrix as follows:
l'(z) -h(.'(._') ~re(z) (l) h«(=t ,~T
Using the Euclidean algorithm, which recursively finds the greatest common divisors of the even and odd parts of the original filters, the forward transform polyphase matrix P(:) can be factored into lifting steps as follows:
na 1 0 I _t (~-~ ~ I l K 0 P(~ ) = I~ i i=I -,si (_ -i ) I 0 I 0 k' ' mCK~ ( where s;(~) and t,(z~ arc l.aurent polynomials corresponding to the update and predicti~~n steps, respectively, and K is a non-zero constant. The inverse DW I' is described by the hollowing synthesis polyphase matrix:
1 si(=) 1 0 K 0 = I~I
i=1 0 1 ai(~) 1 0 1/K ( 07 As an example, the low-pass and high-pass filters corresponding to the Daubechies 4-tap wavelet can be expressed as:
h(~)=h,;-+h,~-'+y,~~,-~+h' 3 ( 10) g(.)=-7~~~' +h,4' -h, +h"J ~
where 1+~~ ;+~r3 s-~s ~ - f>
h --__ ,h, _:_ -.h; = ,h: _-Following the above procedure, we can factor the analysis polyphase matrix of the Daubechies-4 wavelet filter as:

~3-'tl 0 1 - f, I = f (I l) o I ~' + ~~-2_ ~ I Co I~I, _ ~ ~~+I~
The corresponding synthesis polyphase matrix can be factored as ~~+l~ ~ 0 I 0 __ ~ ~~ / I I ~ ( 12) P(-) CO 1 ~ - ~ - ~ 2 _ ' 3+I I 0 I
p __ ,_ ~ 4 Similarly, the 9/7 analysis wavelet filter can be factored as:
1 all+~-') I Oi 1 y(I+~-') 1 0 5 0 ~) 0 I ~(I+~) I~ 0 1 ~(1+~) l~ 0 ~-., (l~
The corresponding synthesis wavelet filter is factored as P ~)_ ~ I 0 LI -b(l~~J)~~ l 0~~1 -~(1+~) 1 0 (l~) ( 0 ~ 0 1 --y(1+~-') I 0 1 -all+z-') 1 where the values of a, (3, y, ~, and ~ are shown in Fig. 8. The computational cost of calculating two Daub-4 DWT coefficients using Equation 11 is nine operations (five multiplications and four additions). On the other hand, Mallat's algorithm needs fourteen arithmetic operations (eight multiplications and six additions) according to Equation 10. In other words, the lifting steps provide ss'% speed up for Daub-4 DW'I' calculation. For longer FIR wavelet falters, the speed up can be up to ~0~%. which is al significant improvement for real-time applications.
SUMMARY OF 1'HF INVENTION
08 To calculate a DWT usin~~ a lifting algorithm, the input signal has to be first s,parated into even and odd samples. Each pair of input samples (one even and one odd) is then processed according to the specil is analysis polyphase matrix. For many applications, the data can be read no faster flan one input sample per clock cycle, so sample pairs are usually processed at every other clock cy-cle. Hence, this is a limitation on the speed and efficiency of a direct implementation of the lifting scheme. To overcome this bottleneck, there are proposed architectures in which data streams are interleaved within the UVV~I .
Recursive architectures exploit the available idle cycles and re-use the same hardware to recursively interleave the DWT stages, and dual scan architectures achieve eft7ciency gain by keeping the datapath hardware busy with two different streams ~.>l~data.
09 There is therefore provided in accordance with an aspect of the invention, an apparatus for digital signal processing, the apparatus comprising a cascade of digital filters connected to receive a sampled input signal and having an output, in which the digital filters implement a transform decomposed into lifting steps, the cascade of digital filters operating on pairs of samples from the sampled input signal. A source of a data stream is also provided, where the data stream is also composed of samples. A multiplexes multiplexes the samples of the data stream with the sampled input signal for processing by the cascade of digital filters.
In a further aspect of the invention, there is provided a method of transforming a sampled input signal into a transformed output signal, the method comprising the steps of:
operating on pairs of tho sampled input signal with a cascade of digital filters that implements a transform decomposed into lifting steps to provide an output; and operating on samples from a data stream using the cascade of digital filters, where the samples from the data stream have been multiplexed with the sampled input signal.
l l In further aspects of the invention, the cascade of digital titters implements a one-dimensional discrete wavelet transform, such as a Daubechies-4 wavelet transform or 9/7 wavelet transform. The cascade of digital filters may implement filtering steps corresponding to Laurent polynomials. lhhe cascade of digital filters may implement a two-dimensional transform that is decomposed into a first one-dimensional (row) transform followed by a second one-dimensional (column) transform. A buffer memory may be connected to receive samples from the data stream and output the samples to the cascade of digital filters for processing of the data stream by interleaving of~ the samples from the data stream with the sampled input signal. The data stream received by the buffer memory may be taken from the output of the cascade ol~ digital titters to provide a recursive architecture.
ffhe cascade of digital filters may implement an N-dimensional transform, whey°e N is greater than 2. and the number of digital filter cascades is N.

BRIEF DESCRIPTION OF DRAWIN(JS
12 There will now be described preferred embodiments of the invention with reference to the figures by way of illustration, without intending to limit the invention to the precise embodiments disclosed, in which:
Fig. 1 is a block diagram of Mallat"s tree algorithm;
Fig. 2 is a block diagram of the ?-D separable DW~I~;
Fig. 3 depicts a general form of the lilting scheme;
Fig. 4 depicts a MAC for asymmetric wavelet filters;
Fig. 5 depicts a MAC for symmetric wavelet filters;
Fig. 6 depicts circuits for the basic lifting steps;
Fig. 7 is a 1-D recursive architecture for Daub-4 I)Wrl~:
Fig. 7a depicts a controller emitting enabling signals to registers;
Fig. 7b depicts a controller emitting enabling signals to delay stages;
Fig. 7c illustrates how the circuit ol~Fig. 7 implements equation 1 I;
F'ig. 8 is a l-D recursive architecture for 9/7 DWT;
Fig. 9 is a l-D DWT coefficient computation order;
Fig. 10 depicts a 1-D dual scan architecture:
Fig. l0a shows detail of the architecture of the block PE in Fig. 10;
Fig. 1 1 depicts a conventional 2-D lifting architecture;
Fig. 1 ~' shows the calculation sequence for a 2-D recursive architecture;
Fig. 1 3 depicts a ?-D recursive architecture;
Fig. 1q depicts exchange ~.~perations;
Fig. 1 S depicts the scan sequence of a 2-D dual scan architecture; and Fig. 16 depicts a 2-D dual scan architecture.
13 This disclosure ends with Tables 1. 3, 5, 7, 9, 1 l, 1 3, 15, 17 and 19 that illustrate the manner of implementation of the recursive and dual scan architectures.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION
14 In the claims, the word "comprising'" is used in its inclusive sense and does not exclude other elements being present. The use of the indefinite article "a"
does not exclude more than one of the element being present.
15 As a preliminary matter, we consider a signal extension method for use in the proposed hardware architectures. 7'o keep the number of wavelet coefficients the same as the number of data samples in the original signal, an appropriate signal extension method is necessary. -Typical signal extension methods arc zero padding, periodic extens-ion, and symmetric extension. lero padding is not normally acceptable for the classical wavelet algorithms due to the extra wavelet coefficients that are introduced. Periodic extension is applicable to all (biorthogonal and orthogonal) wavetet filters, but symmetric extension is suitable only for (symmetric) biorthogonal wavelet filters. Since the lifting scheme applies for constructing biorthogonal wavelets, symmetric extension can always be used for calculating the lifting scheme. LiFting steps obtained by inctoring the finite wavelet filter pairs can be calculated by using simple zero padding extension. After a polyphase matrix representing a wavelet transform with finite filters is factored into lifting steps. each step becomes a Laurent polynomial, namely the s;(z) or ~,(~) from Equation 8. Since the difference between the degrees of the even and odd parts of a polynomial is never greater than two, we can always find a common divisor of first-order or lower for the polynomials. Hence. a classical wavelet filter can always be factored into first-order or lower-order Laurent polynomials (i.e., s,(z) or t;(z)). Lifting steps containing these short polynomials correspond to one to three-tap FIR
filters in the hardware implementations. Because signal extension is not necessary for a two-tap wavelet fitter, like for the Hair wavelet. zero padding can be used in the lifting algorithm.
16 In the preferred embodiment disclosed here, the easily implemented zero extension is used in the proposed architectures. 'hhe sample overlap wavelet transform recommended in .IPEG-2000 Part I1 can also be implemented in the proposed 2-D architecture.
17 Because of the down-sampling resulting from the splitting step at each stage in the lifting-based DW~~, the number of low frequency coefficients is always half the number of input samples trom the preceding stage. Further. because only the low frequen<:y DWT
coefficients are further decomposed in the dyadic DWrf. the total number of the samples to be processed for an L-stage I-D DWT is:
N(1+1/:?+1~4~-...+1/~' ,)=N~~_1/?~ n)~2N
where N is the number of the input samples. For a finite-length input signal, the number of input samples is always greater than the total number of intermediate low frequency coefficients to be processed at the second and higher stages. Accordingly, there are time slots available to interleave the calculation of the higher stage DW h coef~ticients while v.he first-stage coefficients are being calculated.
18 The r~ecaej~.sive crrrhi~eclure (RA) is a general scheme that can be used to implement any wavelet lilter that is decomposable into lifting steps. As i-D examples, we describe RA
implementations of the Daub-4 and 9;7 vvavelet filters. fhe RA can be extended to 2-D
wavelet filters, and can be extended to even higher dimensions by using the methods set forward in this disclosure.
19 The RA is a modular scheme made up of basic circuits such as delay units, pipeline registers, multiplier-accumulators (MAC:s), and multipliers. Since the factored I.aurent polynomials s,(~) and l,(z) for symmetric (biorthogonal) wavelet filters are themselves symmetric, and those for asymmetric filters are normally asymmetric, we use two kinds of MACS to minimize the computational cost. The MAC for asymmetric Inters, shown in Fig. ~l, consists of a multiplier A, an adder O. and two shifters labelled shift. The symmetric MAC, shown in Fig. 5, also has an amplifier A and two shifters, but it has one more adder O+ than the asymmetric MAC. '>,he shifters are used to scale the partial results so that accurac~~ can be better preserved.
2U Different kinds of lifting-based DW h architectures can be constructed by combining the foLn- basic lifting step circuits, shown in Fig. 6. These architectures are a combination of multipliers a and b, one or more shifters z-~ where z-~ relates to a delay of one sampling interval in the time domain, registers R, and adders OO. from left to right, these architectures perform respectively the following Functions: a+bz-~, a+bz, a( 1+z-~ ), all +z). The general construction has the following steps:
Step 1: Decompose the given wavelet filter into lifting steps.
Step ?: Construct the corresponding cascade of lit~tin~~ step circuits.
Replace each delay unit in each circuit with an array of~ delay units. The number of delay units in the array is the same as the number of wavelet stages.
Step 3: At the beginning of the cascade, construct an array of delay units that will be used to split the inputs for all wavelet stages into even and odd samples.
These delay units are also used to temporarily delay the samples so that they can be input into the lifting step cascade at the right time slot. Two multiplexer switches are used to select one even input and one odd input to be passed from the delay units to the first lifting step.
Step 4: Construct a dat,.i flow table that expresses how all of the switches are set and how the delay units are enabled in each time slot. rl~here is latency as the initial inputs for the first wavelet stage propagate down through the cascade. A Free time slot must then be selected to fix the time when the inputs for the second wavelet stage will be sent into the cascade. All higher order stages must also be scheduled into tree time slots in the data flow table.
Step 5: Design the cont~~ol seduencer to implement the data flow table.
?I 7'he RA in Fig. 7 calculates the Dai.ib-4 DWT, while the RA in Fig. 8 calculates the 9/7 DW'T. In both figures, elements labelled R; or R'; refer to registers, D;
refers to delay units, where i represents the stage and is an integer from 1 to L with I.
representing the l;~st stage which would be 4 for a three sta;~e DW~I' since the input signal has i=1 and relates to stage 0.
S; refers to the various switches, where i is an integer from 1 to 4 in Fig.
7, and from 1 to 7 in Fig. 8. The enabling signals for the registers hnR; originate from the controller as shown in Fig. 7a. Similarily, the enabling signals for the delay stages EnD, originate from the controller as shown in 7b. The controller is connected to all enabling signals as described and switches, however connections to switches S3 to S7 have been omitted for clarity. The enable signals and switch positions can be determined at any particular time from cables 2 and 3 respectively for a three stage DW'T. The values e; and o, represent the even and odd values of the input signals, q; represents the input values from the delay stages into the lifting scheme, and F; and O; represent the intermediate values of the CDF in Figs. 7 and 8.

22 In the proposed lifting scheme. a cascade of digital filters CDF (Fig. 7, Fig. 8) are connected to receive a sampled input signal Input and have an output l, h (Fig. 7), I"H (,Fig.
8. Fig. 10). The cascade of digital filters CDI~ implement a DW~I' decomposed into lifting steps of the type shown in Fig. 6 and is dependent upon the DWT being used. :A
buffer memory furnted of memory blocks M 1, M2 (Ivig. 7, Fig. 8), each of which is made of a number of registers R;. is connected to receive samples from a data stream, and the values are assigned according to the enable signals lnR;. In f'ig. 10a, R, and fZ~ form a memory that together with switches ~ 1 and S2 provide multiplexing c>f tv.vo data streams together for processing by the cascade of digital filters C'DR. As for example in the case of the dual stream architecture of Ivig. 10, the sampled input signal and the data stream (a second sampled input signal) may be left and right channels of a stereo signal. or in the case of the recursive architecture of Iv'igs. 7 and 8 the data stream may be taken from the output of the cascade of digital filters. ~fhe buffer memory M 1. M? is connected to output the received samples into the cascade of digital filters CDF. A mt.tltiplexer Mp (Figs. 7, 8) consisting of switches S1, S2 (Figs. 7, 8) and a controller is provided for multiplexing the sampled input signal Inlput with the sampled signal stored in the memory buffer as the input to the cascade of digital filters (CDA) according to the procedures described below. In the ease of the dua:stream architecture of Fig. 10, the inputs 1 and ? are multiplexed together by the multiplexes formed by switches S 1 and S2 and their associated controller (noi shown) for processing by the cascade of digital filters I'C'.
23 In Fig. 7, the input registers R, (i=l ,'?,..., L) and R', (i=3,..., 1,) hold the input values for the (i-I j'~' DVvT stage. Thus the first stage coefficients can be calculated at every other clock cycle and the data for the other stages can be fed into the lifting step pipeline during the intervening cycles. Usingx;.; to denote the,j'~' coefficient of the i'~'stage, the DW'T coefficients can be calculated in the order shown in Fig. 9. The CDA is implemented according to equation 12 and Using the circuits fear the lilting stelas shown in Fig. 6 to perform the necessary operations, where the r-~ has been replaced with the delay stages D;.

24 The input registers R; also synchronize the even and odd samples of each sta~;e. Since the f rst two stages can be immediately processed when the odd samples are ready, no input register is needed for the odd samples for these two stages. Register D; is a delay unit for the i'~' stage. After splitting the input data into even and odd parts, the Daub-4 DWT is calculated step by step as shown in Table ~ . In Table 1, l:" and O" are the outputs of each lifting step;
e_;,; and n_;, denote the even and odd intermediate results of each lifting step. Since the architecture is pipelined by each NIAC' unit, the outputs of each lifting step are synchronized.
As an example, the calculations of the first pair of I)W~f coefficients are given below:
E 1: x" I = x~o.l Ol : x".z = ~o_, F2: ei,, =.y,n 02: o ii = C~.~-un ~ ~"o,~
f,J. ~ I.I - ~« I1 +~ I_I
OJ: ()LI - ()I I
E4: (~_ I.~ _ " IY«--I.I +~'. I.I
O~l: oI.i =.. oI.I
I:ow frequency DWT coeftieient l: l I,, = a>.e- I I
High frequency DWT coefficient h: /2_I,I = c~~~e. ~,~ + o -I,I ~.
25 Therefore, the DWT coef~Fcients of the first stage are generated live clock cycles after the first input sample is received. ~fhe first low ti~eduency DW~I coefficient !_~,, is also stored in register R,. After the second low freduency DW I' coefficient I,, ~ is ready, l_,,, and /_, = are further processed in the idle cycles, as shown in Table 1. 'hhc outputs at various stages of fitting steps corresponding to Daub-4 wavelet are shown in Fig. 7c. The analysis polyphase matrix has four factors. [Ai,+i,Bi>+i] represents the output after p-th matrix multiplication.
cr. /~, y, ~! , e~, co are as defined in Fig. ?. The relationship hetwcen the output at each clock cycle and the output of various liFting d.>ctors is as follow s:

Matr i x-1 Matri x-( 1 +2 ) I ~x 1 a r ~A, ~,~_~.A, ~,~0 I~-~F, O~~~ I~ lE' L,cz+Oi~_~xo,yon+xo-~-~E, Matrix-( 1+2+~ I
~A; 13; ~ _ ~.9, l3, _ ~.~, + Ij, (~3 ~- Y. , ), B, _ ~ ~3 ,. ~. i I _ _~E=+O,~~j+J'Y '), (),~_~E;+();.y--', Matri x-( 1 +2+3 +4 =~E3 +03y- ~, ~(E;z+O,y)+CO =~E~, (~Ea +O~)'~
Matrix-( I+?+3+~+5) ~A j3~ _ ~A 1, ~' U ~) ~~,~ , (~l;_~ + U:, ~~ a 0 - ~~Ea , r~(~E~ + ~)+ ~~~
() l l0 The [Ei,,Oi,~ denotes the output at p-th clock cycle. [A~,,B~,]s are as defined in Fig. 7c.
26 The control signals for the switches in a RA can also be deduced from the corresponding data flow table (which is fable 1 in this case). The timing for the register enable signals is shown in Table 3. Switches Sl. S? and S3 steer the data Flows at each stage.
The timing oi'the switch control signals is shown in ~hable 5. Output switch S4 feeds back the low frequency DW T coefficients (except for the last stage) to be further decomposed. The switching timing for S4 is the same as for S 1.
27 figure 8 stores the data in registers in a manner similar to that described for I~'ig. 7.
The tnultiplexer Mp inputs the odd and even values as pairs into the cascade of digital filters.
The delay registers D;~j, I:);n, D;~.. and D;~; ensure that the values are inputted as pairs into the next set of filters. where i relates to the stage of the DW I~. Switch S7 feeds back the low frequency DW'h coefficients except for during the last stage when the coefficient is outputted at L. The intermediate stages are iroplen~enting according to equation l~ and using the 1?

corresponding circuits shown in Fig. 6 to perform the necessary operations, where the z-~ has been replaced with the delay stages D;.
28 The calculations of the first pair of DVV'I coeft7cients for Daubechies 9/7 wavelet are given below:
En = x~,,~ = xo.i = xa , _, E~~ - ~- i.i = .. .xo.i O, _ °-,.~ = a(~ + toy ) + xo o_,,= _ ~(.uo.i + xu.z ) + xou D;=o ii =- ~.oii ()~ _ « i ~ = Y(~' i.~ + ~'- n> > + « i.i i F., _ ~' , i = _ .c~ _,.i i ~; _ ~-i.i = Y(~ i i + ~ o i.a 1 + r'- i_i ~y = of i = .. ~ .o-i.i Low frequency DWT coefficient L = l = ~.e ,.~
High frequency DWT coefticient N = Gr = ~ ~ .o ~ ~
29 The design of the controller is relatively simple, due to the regularity of the control signals for the IZA, as shown in Table 3 and 'Cable 5. All control signals are generated by counters and flip-flops controlled by a four-state finite state machine. The counters generate periodic signals for the longer period (754 clock cycles) control signals, and the flip-flops produce local delays. If externally-generated start and stop signals are provided, the long counter for keeping track of the number of~ input samples is unnecessary.
Compared to other direct implementations of lifting-based DW'Ts, the overhead for the IZA
controller is very small. The controller should occupy less than 10% of the total silicon area of the 1-U RA.

30 The remaining elements of the RA include registers and switches (tri-state buffers).
Since the area of the switches is negligible compared to the size of the whole architecture, the cost of the registers dominates. 1'or implementing an l.-stage DVVT, the RA
uses (L-I)(,~l~l+1) more registers than a conventional lifting-based architecture, where ~L-T is the number of delay registers. Considering that a conventional architecture needs an extra memory bank to store at least .~V/2 intermediate DWT coetfacicnts, the RA architecture is more area-efficient in most applications, where (l.-I )(~I~l+1 j~~<N!2. The power consumption of the RA
should be lower than that of a conventional architecture because the RA eliminates the memory read/write operations and because all data rousing is local. By avoiding the fetching of data from memories and the driving of long wires, the power dissipated by the RA
switches is small.
31 In Fig. 9 the coefficient computation order is shown with the input at the bottom and computed coefficients in the levels abcwe. and where ~ is used to denote an idle clock cycle where no coefficient is calculated. Since the pipeline delay For calculating an L-stage DWT is L x T~l (where 7~~ is the latency ti~om input to output) and the sampling-interval for each stage computation increases by two cycles for each additional stage, shown in Fig.
9, the clock cycle count 2'~~ for processing an ~'~~-sample DW~1~ can be expressed as:
7'~, -- ry,' -~ ( I x T'a l -i ( 1 ' =' t . . . i ~ ~--- ) - ,~ ~ L x % ~, + ~
~- - ~ - 1.
32 fI_he hardware utilization can be defined as the ratio of the actual computation time to the total processing time. with time expressed in numbers of clock cycles. At each section of the pipeline structure, the actuaf clock cycle count 7'~~ is the number of sample pairs to be processed.
7,c, -~N ~ N(1 _~i_i.))l~.
Note that ,~V(1-2' ' ) is the number of samples being processed at the second or higher stages.
The busy time T,j of the cowesponding section can be expressed as:
T'jr - ~~n - ~a _' .y' . (l. - I ) x ~',~ -~ ~~ - ~ -1.
Consequently, the hardware utilization C,' of the l.-stage RA is:
l =1 ,~+~'V~l-7r L1 C = T~, l7o x I00% _ _ x 100% (15) .~~N; +~~=-i +t j _ I) x 7,/ - I) Because C' is a continuous concave function of variable 1. when L >_ 1, the maximum hardware utilization can be achieved when c~(l~aL=0. Ignoring the delay 7~,, c~U%~L-0 can be expressed as:
c~U %V(l,--I)?-r 1,~~-i -- ~ i =(1.
c~L 2(N'+2 -I)-The above equation is true when 1. == 2-' (log, ~~' + log, (1-1 / 1. ) + 1 ) .
Assuming L > 1 and ~' » ~N , the utilization reaches a maximum of about 90% when L = O.Slog2,f, and gradually reduces to around 50'% when l=I or log~;f. h'or a 5-stage DWT operating on 1024 input samples, the utilization approaches ~)2°ia. When the number of decomposition stages L
increases, the processing time increases significantly and the utilization drops accordingly. As mentioned above, the delay of 2~ was due to the increasing separation (21' clock cycles] of the input values to each stage. If we decrease the sampling grid for each stage as soon as all previous stages have finished, we can speed up the computation. With a little bit additional controller overheads, the processing time in clock cycle of an /.-stage DWT
can be reduced to:
;'fir + (1. x 7~r).
When ,~~-~~o, the hardware utilization oC the 1-U RA approaches 100°/~.
Compaq°ed to the conventional implementations of the lining algorithm, the proposed architectures can achieve a speed-up of up to almost I 00°ro as show ~n in 'liable 7.
33 To achieve higher hardware utilization for special cases, we also propose the dual scan czr-c~hileottrre (DSA), which interleaves the processing of two independent signals simultaneously to increase the hardware utilization. The 1-D DSA is shown in Fig. 10. The input signals lnputl and Input2 are two different data streams. These streams are multiplexed as shown u1 Fig. IOa into the processing element fE that is a conventional direct hardware implementation of the lifting scheme constructed from the basic building block ci~,~cuits as discussed previously. The input switches SW2 and SW3 are connected to one of the t~NO input pairs of odd and even components when processing the first stage, and are connected to the memory NI when processing the other stags. where the two connections from the memory correspond to the coefficients calculated (i-om Inputl and Input?. Switch SWO
separates the low frequency coefficients of the two input signals. Because the architecture generates one low frequency coefficient at each clock cycle, SWO is ec:7ntrolled by the system clock. 'the output switch SWl is connected to the output I, only at the final stage, while coefficients are outputted to H at every stage. 'l he size: of the memory unit is .A~/!2, where M is the maximum number of input samples.
3q 'hhe 1-D DSA calculates the DW'f as the input samples are being shifted in, and stores the low frequency coefticients in the internal memory. Vvhen all input samples have been processed, the stored coefficients are retrieved to start colnptltin g the next stage. The input switches SW?, SW3 in Fig. 10 separate two independent dataflow, while the input switches S1, S2 inside the fC split the even and codd samples. The processes of different transform levels are not interleaved in the dual scan architecture, because there is no idle clock cycle for doing that, and that is why there is a buffer for storing the intermediate coefficients. In other words, the sequence of operation of'the 1-I) DSA is as follows:
1. In one clock cycle, the first pair of odd and even samples e1, 01 from input-1 comes in. and the DWT Calculation for input-1 starts. Switches Sl and S2 select e1 and ol.
The even sample is delayed in 1R,.

2. In the second clock cycle, the first pair of odd and even samples e?, 02 from input-?
comes in, alld the DVv''C calculation fOr Input-2 starts. Switches S 1 and S2 select e2 and o?. The even sample is delayed in R,.

3. Steps 1 and ? are repeated for other pairs of input samples.

4. 5 cycles after the PE starts opcravtions, the lowpass and highpass coefficients will he produced at the output. The PE implements lifting steps ~>f the type shown in Fig. 6, using adders O+, delay elements Di enabled through I:-:nDi and associated switches S3, S4, S~ and S6, and sealers cx. ~. h. ~; whose values arc indicated in I'ig.
10a. The output from fE is scaled by sealers ~, and ~~~.

5. The DVr'T coefficients for input-l and input-2 will be outputted at alternate clock-cycles. The lowpass coefficients are stored in the memory buffer M.

f. The intermediate coefficients cc»°responding to input-1 and input-2 are present in different parts of the circuit, and hence do not conflict.
7. After the first level of decomposition is complete for input-1 and input-2, there will Ilot be any new input. Now the first level lowpass DW I' coefficients that are stored in the memory buffer M are (edbac:k and used as the input o1"the PE.
8. Steps I-4 art repeated with two streams of inputs corresponding to input-1 and input-9. The ~"'~ stage DWT coefficients for input-1 and input-? will be outputted at alternate clock-cycles. The lowpass coeflicients are again stored in the memory buffer M.
10. After the second first level of decomposition is complete for input-1 and input-2. there will not be any new input. Now the second level lowpass !)WT coelticients that are stored in the memory buffer are fedback and used as the input of the PE.
I 1. Further decomposition can be carried out using the sane procedure.
i5 As the 1-D DSA performs useful calculations in every clock cycle. the hardware utilization for the I'I: is l00%. the processing time for the G-stake DW"I' of two hJ-sample signals is .\' ~ L x 1~~. Compared to conventional implementations for computing two separate signals. the I-D DSA requires only halfthe hardware. I Icnce, given an even number of equal-length signals to process. the speedup of the I -D DSA is 1 ~0'%. A recursive architecaure RA
will calculate a DW'h in about Italy the time compared to DSA. It starts calculating: itigher-level DW~I' coefficients even before it completes the lust level decomposition. On the Other hand, a DSA calculates a DW'T for two streams stage-by-stage. I~hc; total computation time is double that of" the RA. 1-IOwever, because it calculates DW'f of two arrays, on average it has hardware utilization efficiency similar to that RA.
3fi A conventional implementation of a separable ?-D lifting-based DW'h is illustrated in Fig. 11, where separate row and column processors, Rp and Cp respectively, each use a 1-D
lifting architecture. The row processor calculates the DW'I° of each row of the input image, and the resulting decomposed low and high frequency components, L atnd H
respecti~~ely, are stored in memory bank M 1. Since this bank normally stores all the horizontal DW'h coefficients, its size is ~'r for an _'fx;V image. When the row DW~I~ is completed, the column processor Cp starts calculating the vertical DWT on the coefficients from the horizontally decomposed image. The LH, HL, and Hf-I subbands are final results and can be shifted out;
the I-,L subband is stored in memory bank M2 for further decomposition. 'fhe size of memory bank M2 is thus at least :~:''~'4. Such a straightforward implementation of the ?-D DW'T is both time and memory-intensive. 1'o increase the computation speed, we propose a ?-D HA and a 2-D DSA for the separable ?-L) lifting-based DW'T.
37 'fhc basic strategy of the 2-D recursive architecture is the same as that o:f its I-D
counterpart: the calculations oi' all I_)Vv''1' stages are interleaved to increase the hardware utilization. Within each DW'1' stage, we use the processing sequence shown in Fig. 12. The image is scanned into the row processor in a raster format, and the first horizontal DWT is immediately started. The resulting high and low Frequency 1)Wrl' coefficients 1I;, L; of the odd lines are collected and pushed into two FIFO (first in first out) registers or two memory banks.
The separate storage of the high and low frequency components H;, I,; produces a more regular data Ilow and reduces the reduired output switch operations, which in turn consumes less power. 'The DW~~ coefficients of the even lines are also rearranged into the same sequence, and are directly sent to the column pre>ccssor together with the outputs of the FIFO.
'rhe column processor starts calculating the vertical DV~~~I~ in a zigzag format after one row's delay.
38 A schematic for the ?-D RA is shown in Fig. 13. The Cl:)1~ contains both the row and column processors. 'Che enable signals EnR; are controlled by the controller as shown in Fig.
7a, registers are denoted using R;, delay stages are denoted with 1);, and switches S1. S2, S3, S4 and S are controlled by the controller. The same controller has been depicted twice for clarity. Values e; and o;, where i dcr~oles the stage, ure values to be input into the row processor Rp inputs L~,z and O,z, respectively: values l~; and O, area to be input into the column processor C".p inputs I~~ and 0~ , respectively. L,;, l l,i and I.~ , I-I~ arc the low and high components ti~om the row and column processor°s, respectively. The exchanges X perform the operations depicted in Fig. 14. the F11~Os are composed of rows labelled 1/2' Rom, where i=l ,2,..., l.. '1'hc multiplexer Mp. corn posed <ol~ the controller and switches S 1 and S2, multiplexes the signal to be fed into the row processor Rp. Note that the row DWT is similar to that of the 1-I7 DW'f, so the datapath of the rc>w processor is the same as for the 1-D RA.

The colunon processor Cp is implemented by replacing the delay registers and input ~;;ircuit of the 1-D RA with delay registers D;. fllv'Os, and the multiplexer Mp2 consisting of the controller and switches S3 and S4, as shown in hig. 13. 'rhe interaction between the row and column processor goes as follows: When the row processor Rp is processing the e~s~en lines (assuming that it starts with 0°' row), tlle. high and low fr~equencv I)WT coefficients are shitted into their corresponding FIFOs. Vfhen the row processor Rp is processing the odd lines. the low frequency DWT coefficients of the current lines and of the previous lines stored in the FII~"Os are sent to the column processor Cp. Registers U; arc: used if the low fi°equency coefficients are generated before their high frequency counterparts. At the same time, the high frequency DW~I~ coefficients of the current lines are shifted into their corresponding; FIFOs.
and the outputs of these I~ IF()s are shitted into the FIFOs corresponding to the low frequency.
The computations are arranged in such a way that the processings of the DWrf coefficients for the first and the other stages can be easily interleaved in neighboring clock cycles. (Jnce the processing of the low frequency ::ompronents is done, the outputs of both FIFOs are sewn to the column processor C'p through the multiplexes Mp?. The titn ction of the exchanges, denoted as X in Fig. l3. is to redirect the data glows between the FIF()s and the input of the column processor Cp. As shown in Fig. l~, the exchange block has two input channels, two output channels, and a control signal. When tloe control signal SW=0, the data from input channel 1 tlows to output channel 1. and the data fCOlll input channel ? tlows to output channel ?: when SW=l, OI7e data stream flows from input channel ? to output channel I, the other data stream flows from input channel 1 to output channel ?. :'fit the lov-v ti-equency output of the column processor C'p, a switch S selects the hl. subband and sends it back to the row processor Rp for further decomposition.
39 A portion of the data flow for computing an 8x8 sample ?-D Daub-4 DWT is shown in Table 9. As described before, the first pair eel,, l and o_~, ~ ~ of the first stage row transform coefticients are generated at the sixth clock cycle. l'hey are immediately shifted into the high and low frequency FIFOs, respectively. The consecutive DV1~'T coefficients of the same row' are in turn pushed into their conrcsponding FIFOs in the consequent clock cycles until the end of the row ~ (the I2't' clock cycle; in this case). Vv'hen the first pair of the row transform coefficients of the second row is ready. the low frequency coefficient o_,,, ~
is sent to the odd input of the column processor, and the high frequency coefficient u_,,,, ~ is pushed into the corresponding FIFO. The first low frequency coefficient oi~ the first row e_~,,,~ is also popped out of the FIFO and sent to the even input of the column processor; its high frequency counterpart u_,., , is pushed to the low frequency FIFO. ,After 4 clock cycles, the column processor Cp generates the first pair of the ?-D DVVT coefficients. of which the low frequency one ll_, , , is temporarily stored in register R->. 7'he row processor Rp starts further decomposing the low frequency DW'h coefficients after the second low frequency coefficient ll,,~, is generated (at ? 1 ~' clock cycle in 'Fable 9).
40 At the end of the row tr,mst~>nn of the second row (at 20'x, clock cycle in this case), both FIFOs for the first stage c<mtain only the high frequency row transform coefficients of the first two rows, and start sending these coefficients to the column processor Cp after one clock cycle. As shown in Table 9, the calculation of the multiple stage 2-D
DW'1' is continuous and periodic., so that control signals for the data flow are easy to generate by relatively simple logic circuits.
41 Similar to the 1-D RA case, the control signals for the ?-D RA are deduced from the data flow as shown in 'liable 9. -The timing for the switch signals of the 2-D
RA for litting-based Daub-4 DWT are shown in 'table 1 l, and the enable signals are fixed delay versions of these switch signals. Also, similar to the delay reduction method used in the 1-D RA. the delay time of the 2-I) DW'I, can be minimised. 'fhe timing of control signals for other wavelets are similar, and can be achieved by changing the delay in fable I 1.
42 Since the high-frequency components are processed one row after the low-frequency components, as shown in Fig. 1 _s and 'fable 9, the processing delay of the column transform for each stage is roughly one row. Also hecause al( the stages are interleaved, the total processing time for an L-stage 2-D DW'~l' is:
~'fx.N+,y' ~ ?xl.x%~,-~ 2~-~-1.
Similar to the 1-I) implementation. the hardware utilization ol~ about 90% can be achieved when L is close to log~N.
?0 43 In a conventional 2-D DWT algorithm, the vertical DWT is carried out only after the horizontal DWrf is finished. 'hhis delay between the row and cc,>lumn computations limits the processing speed. fhe ?-D DS:A shortens the. delay by adopting a new scan sequence. In applications that can read two pixels per clock cycle from a data buffer, the scan sequence of the ?-D DS:~ shown in Fig. 15 can be used. '1'hc row processor Rp scans along two consecutive rows simultaneously, while the column processor C.'p also horizontally scans in the row DW'r coefficients. In this way, the ec>lumn processor can start its computation as soon as the first pair of row D~'~I' coe:ffircients is ready. With this improvement, the row and column processors compute the same stage DW'f w ~ithin a few clock cyclss of each other.
44 ~hhe structure of the 2-I:) DSA is shown in I~ig. 1 (~. fhhe registers R of memor',~ M 1 are used to separately hold the even and odd pixels of each row, and to interleave the input pairs of each two consecutive rows using the multiplexcr Mpl composed c>f a controller and swithces S1 and S? into the row processor fZp inputs la, and Oti for the even and odd inputs, respectively. The computation timing of the 2-C) DSA is shown in Table 13, where the delay of the row and column processor is assumed to be 1 clock cscle. rhhe row processor of the ?-D
DSA is identical to the direct implementation of the I-D I)W~1'. The column processor is obtained by replacing the 1-pixel delay units in the row processco with 1-row delay units. The low frequency and high frequency coehticients are outputted From the row processor Rp from the outputs LK and I-Itt, and are stored in memory M2. The multiplexer M2 interleaves the coefficients stored in memory into the ec~lumn processor Cp. Once the new coefficients have been computed according to tloe methods in this disclosure tier a 1-D DWT, the low frequency output switcl2 of the column processor S~ which is controlled by the controller directs the L.l, subbancl of each stage DWT to the memory bank M 3 through switch Ss controlled by the controller, or if it is the last stage, switch SS outputs the coefficients outputted Ii-om output Lc as the Hl.subband. ~Cho LL: subimage scored in the memor)- will be returned to th a DSA input for further decomposition after the et.trrent DWT
stage is finished.
The coefficients outputted from output f-Ic are either the I.H or 11H subband.
45 'hhe processing time for each stage is:
0.5:'~'yl-1/4'y+27;~
?1 Because only a quarter of the coefficients are tirrther decomposed, the total processing time for a L-stage 2-D DWT is:
(~; 3)N'(1-li4'y+2h,1.
Compared to a conventional implementation, the DSA uses roughly half of the time to compute the 2-D DWT, and the size of the memory for storing the row transform coefficients is reduced to M rows, where M is the number of delay units in a 1-D filter.
~fhc comparisons of the processing time and memory size are shown in I~ablc 15 and Table 17, respectively. In 'fable 15, the tuning for the RA is based on one input pixel per clock cycle, while the others are based on two input pixels per cycle.
46 As the dynamic range of the DW'f coefficients increases with the number of decomposition stages, the number of bits used to represent thv coefficients should be large enough to prevent overt7ow. Bits representing the li-actional pant can be added to improve the signal noise ratio (SNR) of the calculated DW'C coefficients. In simulations described below, the filter coefficients and the DWT coefficients are represented in l6 bits (11-bit integer and 5-bit decimal ). hherefore, 16-bit multipliers are implemented in our designs, and the}.r results are also rounded to 16-bit. The SNR and fSNR values for the 3-stage forward DW'f of the test gray level images are listed in ~(,abl~ 19.
47 hhe proposed architectures were synthesised and implemented for Xilinx's 'Virtex 1I
FPGA XC'2V250. ~fhe I-D RA implementing the 3-stage 9/7 lifting-based DWT uses logic slices out of the 1536 slices available in the fPGA. The ?-D RA
implementing the 3-stage Daub-4 DW1' uses 879 logic slices, and can compute the DWT of 8-bit gray level images of sizes up to 6000x6000 at 50 M1-lz using the built-in RAM blocks and multipliers in the FPGA. To estimate the corresponding silicon areas for ASI(' designs, we used S~~°nopsys' Design Compiler to synthesize the above architectures with TSMC's 0.18-Crm standard cell library aiming for 50 MHz operation. Since the MAC unit is the critical element in the designs, higher operating frequencies can be achieved by implementing faster multipliers or by pipelining the MAC units and minimising the routing distance of each section of the pipeline. The synthesized designs were then placed and muted by Silicon P,nsemble, and the final layouts were generated by using Cadence DFI1. The core size of the 1-D
RA
implementing the 3-stage 9/7 DWT is about 0.177 mm' (90°% of which is the datapath. 10% is the controller, and the rest is memory), and the core size of the 2-D RA that calculates the 3-stage Daub-4 DW'7' of a 256x256 image is about 2.25 mm2 (about 15% of which is the datapath. 5% is the controller, and th a rest is memory). The core area could be reduced by reimplementing the delay units as register tiles instead of separate flip-flops, and the performance of the proposed arhciteetures can be further improved by optimizing the ciruit deslgllS.
~8 We have disclosed two recursive architectures and two dual scan architectures for computing the DWT based on the lifting scheme. Compared to previous implementations of the lifting-based DW'f, the disclosed architectures have higher hardware utilization and shorter computation time. In addition, since the recursive architectures can continuously compute the DWT coefficients as soon as the samples become available, the memory size required for storing the intermediate results is minimized. Hence. the sizes and power consumptions of both the 1-D and 2-D recursive architectures arc significantly reduced compared to other implementations. In addition, since the designs are modular, the:- can be easily extended to implement any separable multi-dimensional DWT by cascading I~I of the basic 1-D DW'f processors, where N is the dimension of the DWT, by using the principles set forward in this disclosure. We also believe, on reasonable grounds, that the proposed architectures may be used to implement lifting schemes for multiwavelets.
49 Applications in which wavelet processing sad hence the principles in this disclosure are potentially useful include but are not limited to image processing, compression. texture analysis, and noise suppression; audio processing, compression, and filtering;
radar signal processing, seismic data processing, and t7uid mechanics; microelectronics manufacturing, glass. plastic, steel, inspection, web and papei° products, pharmaceuticals, food and agriculture.
~0 Immaterial modifications may be made to the embodiments disclosed here without departing from the invention.

Table 1. Data Flow for the Three-Stage 1-D Recursive Architecture x;_, is input signal, I and j denote the stage and the sequence, respectively;
e_,,; and o_,,; are even and odd intermediate results of each lifting step: l_,; and h./,, are low and high frequency I)W~l- coefficients C'1 ' Inpu I ___ _._-.. [,, ~~~.~ _!,''; ; (>; _ .___ _._ ._._ I l., ; O, I~:~ _ <>~ l : h Stage k ! t i 7.1 _.____ _.___._. -.__-__~_ ___. _..__._ _-_ __ xrz ~ xo, I : xo,z L'-LI__()-l.l ~.._~____._. __-.- ._ ._..__.__- --_ -r-~ -_~.- __ _-~_. i -x'u.~ xn,3: x(ya ~-l.I : o_LI
-.~~__~-_ --_~ .____ -. _.____ fns e_j~.o_ ~ -.- I , . ~. 1.r , nr ~n,h xn,s: xn,6 e_1,~~ ; o_I,~ 1_r~l ; h_~,/ 1 __..__ -_ _.__ i ~ _ 7 ~ Xtl,__ ~'_l.3 ; ()_/,3 e_/., ; ~_I" , __--~.-..,~. __.__-._.. _---.___..~-.._...-.. ___. __.-. _..___~ _~__-_ 8 ' xn,,Y xn, ~: xn,8 e_n3 ; o_n.3 l-L~ ; h_,.~ I
9 ~r7,u r-l.% : j-L= ~-l.~r ; o_,,.r e_/,3 ; 0_x,3 _ ,__.~_. ._ _~___ __.___ ~ _. _ ___ __ ~ .xr7 lr~ xn,9: xo.u7 ~ ~ I : u_~.j -.-G-l'~ o_/,~-. _.l-1,3 ; h_I,3 ._ I
~I
11 xr7 Ir ; e_l._s : o_ns ~'-I.I : ~-I ~ e_/,~ ; o_/,.~
- __-_._._ ~ , xr7 /~ xn.Il : xn.l~ -J._~ e_LS : o-l.; _.c-I.! : ~'-I = l-I..r; h-r.~ 1 1 ~ xo l3 ~ l_/ ; 1_r.7 r'-L~ : v_~ ~'-l.s ; o_r,; l_~n : h-~.I
i _ _ , _ _ ___. _ _.-_-_ ~- _ _- ___ _. -__ __ _ 14 .rt>,l.r-~ xo.l3 : xo.l-r r'-~.o=,,_ --~_/,~ ' o-l.~ ___-__ __ ~- .s; h_/.s 1 _ _ _ _ IS = xnls ! ~-l,~ c)_/,~ rr_~,~ ; o_=,~ e_/.r~ ; o_/.h. -__~ _._____~ -__ _ 16 .ki7.m xo.ls: xn,l6 j e_/,~ ; o_l.~ c-'-~. > ; o_r.~ l_/.~ ; h_l.6 1 l~ xo..~~ ~ ~-L., : l-i.r~ r.'-l,s : o_/,~Y ~ e_/.~ ; o_/.~ l-~? : h_~ , 18 xn./<Y h xn.l%: Xo.lB 4'-= ; o_~ ~e_/,,~ ; o_r s I-r.; ; h_/.~ I ' ~-__~.- --~ _ 1 ~ .xr7,19 I ~_3. / l-~ _ ~~'-l 9 ; U_1. 9 ! ~'-?, 3 , o_ ~. ; e_/,<Y : ~-1.8 ~
_ -,_ -?Cp,?r7 .X0,19: xll.?0 ' ~'-3 l U _;m ~I e-l.9 ; O_).9 ~,'-~,3 r)_: ; Z_/,y ;
h_L8 1 __ _'.__ __.._.._-_____ __. _ __._-~_-_.-__-';__.___.._.___-._. _-_--XO,'l ~' ~-1. t0 : ()_ ! C'_3.1 ; ()_;I -I ~ I-L' h-l.-Y E'-l.9 ; D_t.9 ~-? 3 ; h_?_3 ', l, I rr._ __ I . __-._. __ I xfl." <'-?.-! : o._.-l ('.-L10 : O_ l_ v .h_ 1 xo.zr : xo.z~ ~ ~-: r t)_: r r. , l.9 ____ _ ______.__ ~-_ __ -_ .__. I -'__I °__-.._. _._ .__ _. _ __-x", ~ ~ e-t. a : o_ e_~, t : n_~u ~-t, tn ; o_ ;
?? xo.=t : xo.?~ ~-_.t : h-i.t t.tt - t,10 Table 3. Enable Signals for the Input Registers (k is the sample index) of the I-D RA Implementing the D4 DWT
Time. T~" I;nahle signals (in clack cy~clcs) ?k ~ F:nR, ~ - ~ >rnD, 4k + ~I ( FnR~ ~ - ~ EnD
8k + 9 f I:nR' EnR'z T ~, Enp, . .
_._ _ ; _-r__ __ _ ___.__ 1 . _ ______ * _ t'k+sx_t--- +~~~t-1 C:nR,, I.nR'n _ ~._ _~uD~.
* The actual times are: T~" + 2'~-,.
** The actual times are: 'I~~" + 2t -t + La~encyy fi-orn .S'2 to S3 Table 5. Input Switch Control Timing for the 1-D I2A Implementing D4 DWT
Time, T, ! Switch Positions (in clock cycles) '+--._._S l-._. _ ~~_ T._s3 _ -_ __~ .__-____~- ___ ?k+ 1 j e, of q, ____-_-~k+.(, -___ ..__._ ~____ - _(y __-_q~--_. -w _ 8k+ 1~ -~_ ~ e3 0;
i ~~k+3x?~ - +~i:+ I alonc.l ,___...__ ~'_._- _ (,I, t-_y * 'The actual times are: 'I'~ + 2.
Table 7. Computation Time and Hardware Utilization for 1-D Architectures N: Number oFinput samples. I~~i, ~1~~,~,;,~: Circuit delay, l,: Number of DWT
stages ArchitectureC.'omputation hardware Time (clock cycles lJtilization RA N-+'raL 50% -90%

Direct ?N(1-1f?' )+T<m,"I,i0%

implementation Iv'olded 2N( 1-1 /? )+ =100~%~
I'~m,,~ L

Table 9. Data Flow for the Three-Stage 2-D Recursive Architecture a;l.~ is input signal, l,,j and !~ denote the stage, the row and column sequences, respectively: c'_ ,,,,,~. and v_,,;,~ are even and odd intermediate results of each lifting step; l_,,,;,~. and h_,;,-,~. are low and high fi-eduency DW'I~ coefficients --_--- _._ _...... _._..____ _..__-. ___.__.~_.___ - stag Row Processor FIF Os leer Stage 1 Column Processor ~ Clk Input a L-,z _ C),z ~-,z ; Hiz High Frequency I_ow >~requency E~~ ; O~~ Output-__ _- _ _. _ _-_ _~ ____ . _. _ -_ -. _ ___.. _ - _ I xn L I _ _. ~ x ", ,. l xo. l, l ; xn. z, l -_.-- _____ _ _.__. __. __.- _ __ J x rl, I _~- _-_______ ___ ~__.___. _ _t __ _ 4 .xo,.r.I xo.3.I : xn,a,I
_ __-~- _-_ _____ _-__~_ _-______ - --_-_ ~ .xn ;.l ~'-I, I. I ; o_ .~ n. r~ r H xn.~,l: xo.~,l ~' l.l.I o_Lnr I. I. I
,_. _~-i _...._..-_._._..____. ~-_....__._ _-..__.____ .____ -_-__.._-__ 7 xn -, I ~ ~ i __. ____. ~ ~_ _ .___ _-_- __~-__ xD,a l ~ 2_1,2,I ; o_ ~, c~ '' x0.,'.l ; xll,8.l ~'-I,L;': C'_;..'.l l)-LLl. !!_L..l 1.2.1 .-9 a(l./._' ', ~ _ _---.__.____ __.____.. -___...__.._-_ _. __- -._, _~ _~ - _ _ .Ylr. ~ ~ i ~_I,3, I ; O_ ~ xr~,,, ~; xo,z.~ ~-l.I, n c~ l.e.W~-n.~ l l o_l. rn, «_/ ~ r; 07 5n I ' i i _ _ - I,3.I -- ____.__..._ ___. _.-___.__._. _... _-..~_-.__' 1 I .~ p, ; ' i .yl..l_' ~'-1,41 ; 0_ G_/.l.l: ~'-l.' 1,~'- U_Ll.l. O_/,? l~()_ 12 ~ xo..3.z; xn,.r,z L-f,l L:.I; c'_I.r l I i.;.l; o_L-!.I
13 x". ; , ~ - __ __ --_ .__ _ _ -._ :- _. ._ ._ __.- _.__ __ -__ xp.6.? e_1.1.2 ; n_ c'-;,=.l:r'_La I;C'_ ', o_L: I:(_l.i.l;lle_I,I,I ; e_ i 14 xn.s,z: xn.6,2 a 1.1.2 i.ll:r)_I.i.l ~ L-Ll:()_Ll.~ I. I,?
1 S .~ _...- __._ ....___ __......_~_ .._ ____..___.._._._.~ -____--i Il. '= I
SOY,' G'_7.2,2 ; U_ ' c'-L3.l~c'-L-LI,I)_ j O_I..3I;()_l,.I.I;I). 2_1.2,J
". , 16 x~,,,~ , xo.~,~
1,2.2 LLI;o-L- i i LL_;o_i._._ 1,2,.'.
_ i 17 t rl l 7 _ -___ _ _.__._.__ _ ___ ___r._.__ _._____.._ _ __-_... ____ .
__ -__ ~__..._- _-____ ___._..~ ______ ~ _i xa~.3 e-1,3,z ; o_ ~'-I ~.I:n_, l no_o_I ~ I;o_LL~:o. e_1.3.1 ; e_ ll_l,l,l ;
lh_ ~ 18 xo.l,3: xo.2,3 I 1.3,2 L?l:l)_i.;l L~.?:O.I,;.~ 1,3,2 I,l,l I
'~ 19 xo.s 3 ____ _..__.__ _ .-_. __.... ____.- _ _-_ -._ -I xn.r.3 e-1.4.2 : o_ o.l.m;n_, _~./:o_ c)_nl.~;o_I,~e;n. e_I,a,l : e_ 20 xo.3,3: xo,4,3 ll_1,2.1 ; lh_I.z,l I
1,a.2 I. 3, I: E)_I a.I I. 3.?:O I.-r.~ I,a,2 ____._ -_. ___-___ _ _Yrl,~3 ~ llI,1,1 : l~
L '. I
- ..''rJ G.3 e_1.I.3 ; o_ o_I, ,.I. E)_I, .I: E)_-__~. _(~_I._,. ~; E)_I, ;.
~: u_-.- o_l, I,I : o_ ll_I,3.1 ; lh_ i 22 xn.s.3: xo.s,3 I
1.1,3 1.4.I:~-'-I L~ 14.~;0../.L3 1, I,2 I,..~,I
22 xn , 3 _ _.,- _ __ - -_ - . -_ ~Xil.~l~.3 e_I.2,3 ; O_ 0_1.3,1:0_ll.l;C'_ O_1.3.?;U_1,4.?;u_ O_I,~.l ; O_ I ? ~ xo, ~, 3 ; xo, s,3 ll_~,4;1 ; lh_1,4,1 I
1, 2,3 L I. S E -L=.- ' 1.1.3; ()-l,: 3 1,2,2 xn.l.4 ll_1.3.1 ; II_ e_,.I.I ; o_ ~- _.__. _ _.
24 _ L-r.l ..:.r i xn._.4 e_1,3,3 ; o_ n_1.41,ca_In 3.e_ ~__(,._ -I o_~.. ;;u- __ 0_1,3,1 :«_ hZ_I,I.I ; hh_ I 25 xo.l.a: xo.2.a 1 I, 1,3,3 L3.3;E'_1.3.3 L?.3~E)_13.3 1,3,Z l,l,l 26 .Yp, 3.4 i --__-____-_. I---_.__.___.
__ ___~.____._~__...____~__._____.__-__-.__ _ _ .rn.u l e-La,3 : 0_ c_I L;; c.l.=. ~, c'_ I n_1 1 3; c_I,?, i; u_ 0_1.4,1 : O_ hl_I,~.r ; hh_ 2 2 xo.3,a : x~, a..l ' 1 I.a.3 L_ ~.E_/ l; i /.~.;;n_/.I ~ I,4,2 1.2,1 _ -;_ _ __._..___ .__.. _ _. _ __ _..._ _- __ __ __ _tr~ , / ~ E'_=.= / ; E)_ I
28 i ~ i a. ~ l _ __ xO.G. ! e_1.1.4 ; E~. G'-L? 3. ~'-I,3 3; C'_ ~ r)..L~?,3; O_L.i. 5: 0_ e_1,1,3 : e_ hl_1,3,1 ; hh_ 29 xo,s,a: xn,6,4 I I
1.1.4 I ~! ,,O_I I r 'l)-LI ! I,1,4 I,3.1 x0. - 4 I -..._. __._ __..._____ . ._ __. __ _ __- . _ _ __ ~~_~_._ -____- _._.
xn.,s.-I e-I,z,4 ; o_ E'_L33;C'_I,.I;;r)_ 'I 0_L;.;oo_I.l.;.ce_I,2"3 ; e_ hl_I,4,1 ; hh_ _s l xo,~,.r; xo,8,4 1 l,2.a 1.1.3: r)_I,=. 3 I, La; E)_L=. I 1,2,a L~l 1 _..---_~__ ~--_.-.~- _ ~2 x/) L?
I
_. -. ___-~ -_ -l:rL3._s e-1,3,4 ; O_ E'_1 4. 3: n_I l 3. U_ n.l a 3: 0_L I,.l; o- 2_1,3,3 ;
2_ ll_1, h2 ; lh_ I
xo. I. s : xn, 2, s ' I
1,3,4 _L?. 3: E).I, :.3 L.'.4~ E)-I. 3,4 I,3,a 11,2 __ -_ __ _ _.. ,. _._. _ _ _ ..._ _ ... ._ -_ _ _.... -_- -- i '4 xo~
- - -- .___~---.__-~ ..'~-__-___..-_____-~___e_ ~_ _.___ - - __ _.. _ __.-- _.___ _ _.-xll,, s e_7,4,4 : o_ o./ r _ : u_/ _ .;; u_ n_/ 1 r: n_L~..r; r~_ e_r.4.3 : e_ _ 35 xo,3,5 ; xo,4.s ll_r.2,2 ; lh_r,2,2 I ' j 1,4,4 !. 3, ;: U_I,~. 3 1 3 l; U- L L-l r,4,4 xl,, s, s ; llI,, = ll_ __ _- _-__-__- _ ' 36 /...~
I "
__-__.______--_._______. ____._- ' I xn.6.s e_h l.s ; o_ u-,.=.3; u_I .f,.3: o_ u_L: t; u_1.,. t; no_fu,3 ; o_ ll_r,3.e - lh_ 37 xo.s.s: xo,6,s ~ I
l.I,s L t. >. ~_L; ? l -I t; o_ln s I, r,4 r,3,2 ,_, g _xll ' , ___. ____ ______._ xn,,s. s e-7.2,s ; o_ o..l, 3, 3; o_l,./. 3;e_ ~ a -1.3. l; O_l, t ~; E)_ O_r.2.3 : O_ 39 xo,~,s; xo,s.s ll_7,4,z ; lh_r,4.2 I
1.2.5 l, 1. s;E'_I,?,; l l.?;O_I.?,5 7.2,4 xrl I 6 ll l, i. _' - ll_ C'-?.1.. ; ()_ -_ --___ _- ___-_-- C,-, / / C)_ ._ 40 _ _ i L~.e a l.~ I - L' ;
__ ~ ..r~ ~ 6 ~ e_r,s.s ; o_ o_ I 4 3; e_I ~ ;: e_ u_,-~ ~,.o_,.1 ,; u_ 0_/,3.3 ; o_ hl_r, l.<~ hh_ _ 4I ~ xo.r.6: xo.2,6 . I
r.3,s 1.. 5: C'-/,.;. ? 1 3..~; E)-I, , > 1,3,4 1, l, ~
i 42 x'rl 3 6 ~~~ _ _ -_ _ __- _ _ _-j _ _- - _ --- __._ ___ ~ ____ ___ _ __ I .Crl.-l.6 e-r,4,.i ; O_ e_/ L;; e_/ ~ s; C'_ I u_/.l ,,'()_l._ i; (l_ D_1,4,3 ; O_ hl_7,2,~> . hl2_ ', 43 x0,3.6; x0,4.6 .
_~-~ I,~.S __- % - _ ._. ..__! ' '-E) L~' _ _- _ r,4~4 l,2,2 i xos6 ~ e_~,~ ~ o- c'-~~n : c'_ ll_~.nl ; lh_ ~ a~.~ ~~.al xo,66 e_i,t,6 ; o_ e_n=;:e_lJa,;;e_ o ~ ;o L_,s -1.3.s:c)- e_r,l,s ; e_ hl_1,3,2 ; hh_ 45 xo.s.6~ xo,6,6 1 j 1,L6 L-1.5;E)_Ll.? l.~l.?:0_1,1.6 1,1,6 1,3,2 46 xrl. -.6 I -_ _-_-xn a~ e-r,2.6 : u_ e-r ,: ~ _ r a ~: u_ ~ -~)_I ,: o, ~ ,; o- a ,s : e_ hl_r.4.~ ; hh_ 47 xEto, 6 ; x/,as.6 _ j 1 ~ 1.2.6 I ! Ls; ()-I ' ? ~ L L~" ()-i.= G I.2.6 r,4.2 i _ _ _- -___.-j-- - _ __ , 48~xo.1 ~ -__- o_J,L! : u_ ll_~,~n : lh_ I r, 2 2 ~ r xo. s. ~ e_r.3.6 ; o_ e_,~,.r, ;; o_, L s: n_ /, ~, s; l ou ,. ;; o ; a ~ a Ih_ LL6 -l _ 6 -r,3.s ; e_ Il_,,,,3 , 49 xo,~,~: xo,2,~ I 1 ' 1,3,6 E)-;..~ ~ , ()-I s l> ~ 1,3,6 1.1 3 ', __ 5~ 0 3.; I-. _. - __. ._ - _ _~___ -- -._ -______ ._. __._..._. ____. -_... __-j .~l).-l.? e_/,4,6 ; 0_ U_l, l.?; U_l,=.?; O_ !O_I, L6; 0_/,?.6 ; 0_/ 3.6 e-1,4,s : e- I
xo.3,~: xo.a.~ ll_7,2.s ; Ih_7,~,3 1,4,6 1.3.s~ 1)_L t,? i ,'(_l. l.fi I,4.6 , __. --_ -__._..__-______.._ ~__ ___ _. _ __._. _ _ _ _ I
7~

";; ll.~.~3 l~ _ _-___~_,._ ____ hl_~n.~ ~ hlz_ ~~ 5 2 ~ I ?
l.',3 ~_ . -~ '.l.l -._ _. --__. ___ _._-_ -._- .._ -_ Table I 1. Switch Control Timing for the 2-D RA Implementing Daub-4 D'~T
Time, T, Switch Positions ! Time, T~ ~--Switch Positions-_ (in clock cycles) =~j-___-~ ~~- ___ dill-clock cycles) ~ S3 ~ S~--~'ka-1 i e, ~ of ?(!n 1 ),'~ +2k-'.(i Ii Oi _~(l.+ ~ )y' +~k+'~ ____~~ __~ _ p,_~(l -i ),~+4k ~ 14 -F~; _~I~.._ ~~1~ ~ )N-~-8krt-1~7 I _ e3 ~ (,, 8(l ~-1 )N+Rk ~.:~~ E3 O;
__,_..-_. _ __.~ _ +~~, ktl+'T~l +?c k.f 3+27ar1.
Table 13. Data Flow for the 2-D Dual Scan Architecture x,,; is input signal, I,.j the row and column sequences, respectively; e_,,;
and o_~; are even and odd intermediate results of each lifting step; l_,,,; and la_~,; are low and high frequency DWT
coaf~ii~ients ______- _ __.r_, -_____ ..____ ________.___ _ _ _._..- _ ._.__. -_.. __-__ C1 Row Processor Column Processor Input -_..._._.____...___, ._ -.._.._._ _.._..__ _..._._.. ______~ __-k L-':,z : Oit L..,, ; EI,z I~ ; 0~. L~~ ; l-1~
.-..__-_- _..__ __~__ _._-.__ _._ 1 x~ ~ .x~, -..._____.___ __-._...._. -_._._ -__ -,.. T_~ ._~...- ~_._ _~..__ ~ _. _ 1-' _ I -. ~ ..., .._- _ __.,. ~l ~ . () ~ ~ ~ ..... .. _ _._. ..._ ~.__.-_.._ __. _.-.t__-' 4 i x~n v.,._~ x. i ~~.~ ~'t _~ o~.~ ~ ~'~ i ''~ ~~-r,;,, x.r.~~ c'~.~ o~.l ~i ~ : t~r.~ ll~u : lhm --__ - _ ..__. '_ ()i '__~', . r : ~?. ~ hl r. r : hkn, ~
.-_.__-J-.__j_-__..__ _ ~. I __-____ _ -______ ~_~_ ~~- ll'- lh~,, hl , _-.__ - - _ _.._-_ _- _.._ ~ .__ ~ -. ~ . h y= , i_ __ __ __ ~ _-~___l Table 15. Computation Time and Hardware lJtilization for 2-D Architectures NxN: Size of the input image. '1 ~, ~l'~i~,~,~: Circuit delay. h: Number of DWT stages Architecture Computation ~t,ime I-Iardware (9/7 DW~I~) (clock Utilization cycles) RA N' +N+2LT;7 +?~-' 50%-70%

DS (?;)~'-' 1-ly' +27,,1.X100%

Direct ! (~, ~)~, ~(1 _1 . >0i ~cnplementation~1, ~+~~;~1>

Table 17. Comparison of Memory Size for 2-D Architectures NxN: Size of the input image. "l ,i, ~h;,~i~,~,: Circuit delay. L: Number of DWT stales Architecture Memory Sizc RA for ~>/7 w~avelet 4,~~

RA for I)~I w~avclet 10:~~

DSA for 9/7 w~tvel~;t ~~~ -' + 4;y Direct implementation (j l4)N'' Table 19. SNWPSNR Values for 3-stage forward UWT
', Lcna Barbara 5NK -~__ _SNt _.
PSNIZ __PSNR~

Dattb-4 __~~~,2~5.5.__-~ 69.65?9 -75.18 __75-~~__.__.

9,!7 6~i.147 74.8> f,8.7880 74.73 W

Claims

What is claimed is:

1. Apparatus for digital signal processing, the apparatus comprising:
a cascade of digital filters connected to receive a sampled input signal and samples from a data stream and having an output, in which the digital filters implement a transform decomposed into lining steps, the cascade of digital filters operating on pairs of samples from the sampled input signal and the data stream; and a multiplexer for multiplexing the samples of the data stream with the sampled input signal for processing by the cascade of digital filters.

2. The apparatus of claim 1 in which the cascade of digital filters provides a dual scan architecture.

3. The apparatus of claim 2 in which the multiplexer is implemented using a switch and a controller for the switch.

4. The apparatus of claim 2 in which the cascade of digital filters implements a one-dimensional discrete wavelet transform.

5. The apparatus of claim 4 in which the one-dimensional discrete wavelet transform is a Daubechies-4 wavelet transform.

6. The apparatus of claim 5 in which the one-dimensional discrete wavelet transform is a 9/7 wavelet transform.

7. The apparatus of claim 1 in which the cascade of digital filters implements filtering steps corresponding to Laurent polynomials.

8. The apparatus of claim 1 in which the cascade of digital filters implements a two-dimensional transform that is decomposed into a first one-dimensional transform followed by a second one-dimensional transform.

9. The apparatus of claim 8 in which the cascade of digital filters comprises a first cascade of digital filters to calculate the first one-dimensional transform and a second cascade of digital filters to calculate the second one-dimensional transform.

10. The apparatus of claim 9 in which the first one-dimensional transform is a row transform and the second one-dimensional transform is a column transform and the sampled input signal is organized as a two-dimensional array with one or more rows and one or more columns.

11. The apparatus of claim 9 in which the first transform is a column transform and the second transform is a row transform and the input data is organized as a two-dimensional array with one or more rows and one or more columns.

12. The apparatus of claim 1 further comprising:
a buffer memory connected to receive samples from the data stream and output the samples to the cascade of digital filters for processing of the data stream that is obtained by interleaving the samples from the data stream with the sampled input signal.

13. The apparatus of claim 12 in which the data stream received by the buffer memory is taken from the output of the cascade of digital filters.

14. The apparatus of claim 13 in which the multiplexer is implemented using a switch and a controller for the switch, and the controller is also connected to the buffer memory to control the loading of the buffer memory.

15. The apparatus of claim 13 in which the cascade of digital filters implements a one-dimensional discrete wavelet transform.

16. The apparatus of claim 15 in which the one-dimensional discrete wavelet transform is a Daubechies-4 wavelet transform.

17. The apparatus of claim 15 in which the one-dimensional discrete wavelet transform is a 9/7 wavelet transform.

18. The apparatus of claim 12 in which the buffer memory comprises single-stage registers.

19. The apparatus of claim 12 in which the cascade of digital filters implements filtering steps corresponding to Laurent polynomials.

20. The apparatus of claim 12 in which the cascade of digital filters implements a two-dimensional transform that is decomposed into a first one-dimensional transform followed by a second one-dimensional transform.

21. The apparatus of claim 20 in which the cascade of digital filters comprises a first cascade of digital filters to calculate the first one-dimensional transform and a second cascade of digital filters to calculate the second one-dimensional transform.

22. The apparatus of claim 20 in which the first one-dimensional transform is a row transform and the second one-dimensional transform is a column transform and the sampled input signal is organized as a two-dimensional array with one or more rows and one or more columns.

23. The apparatus of claim 20 in which the first transform is a column transform and the second transform is a row transform and the input data is organized as a two-dimensional array with one or more rows and one or more columns.

24. The apparatus of claim 1 in which the cascade of digital filters implements an N-dimensional transform, where N is greater than 2, and the number of digital filter cascades is N.

25. A method of transforming a sampled input signal into a transformed output signal, the method comprising the steps of:
operating on pairs of the sampled input signal with a cascade of digital filters that implements a transform decomposed into lifting steps to provide an output; and operating on samples from a data stream using the cascade of digital filters, where the samples from the data stream have been multiplexed with the sampled input signal.

26. The method of claim 25 in which the data stream is taken from the sampled input signal to provide a dual scan architecture.

27. The method of claim 26 in which the data stream is taken from the output of the cascade of digital filters to provide a recursive architecture.

28. The method of claim 25 in which the sampled input signal comprises samples from a two-dimensional image.

29. The method of claim 28 in which the cascade of digital filters implements compression.