AU9030298A

AU9030298A - Variable block size 2-dimensional inverse discrete cosine transform engine

Info

Publication number: AU9030298A
Application number: AU90302/98A
Authority: AU
Inventors: Kenneth D Easton
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 1997-08-25
Filing date: 1998-08-24
Publication date: 1999-03-16
Also published as: WO1999010818A1; KR20010023031A; EP1018082A1; CN1268231A

Description

WO 99/10818 PCT/US98/17423 VARIABLE BLOCK SIZE 2-DIMENSIONAL INVERSE DISCRETE COSINE TRANSFORM ENGINE BACKGROUND OF THE INVENTION 5 I. Field of the Invention The present invention relates to digital signal processing. More particularly, the present invention relates to a novel and improved variable 10 block size 2-dimensional (2-D) inverse discrete cosine transform (IDCT) engine. II. Description of the Related Art 15 The 2-dimensional discrete cosine transform (IDCT) and inverse discrete cosine transform (IDCT) are important signal processing operations in digital image compression. One such digital image compression application is in the area of high definition television (HDTV). In HDTV, the analog video waveform is conditioned and digitized by an analog-to 20 digital converter (ADC). The resultant sampled data is then digitally processed to minimize the amount of data which must be transmitted and/or stored while retaining high picture quality. Typically, a key element of the compression process is the 2-D discrete cosine transform wherein an NxN block of sampled data, or an image, is transformed from the time 25 domain to the frequency domain. The transformed data can be further processed with block codes, such as Huffman code, run length codes, and/or error correcting codes, such as convolutional codes and Reed-Solomon codes. An exemplary HDTV image compression scheme is disclosed in U.S. Patent No. 5,452,104, U.S. Patent No. 5,107,345, and U.S. Patent No. 5,021,891, 30 all three entitled "ADAPTIVE BLOCK SIZE IMAGE COMPRESSION METHOD AND SYSTEM", and U.S. Patent No. 5,576,767, entitled "INTERFRAME VIDEO ENCODING AND DECODING SYSTEM", all four patents are assigned to the assignee of the present invention and incorporated by reference herein. 35 The digitally encoded video waveform is transmitted and/or stored. At the receiver, the reverse of the digital signal processing is performed to reconstruct the pixels of the original image. The recovered image is provided to a digital-to-analog converter (DAC) which converts the reconstructed image back to an analog video waveform which can be 40 displayed on a monitor or television.

WO 99/10818 PCT/US98/17423 2 An important element in the decoding process is the inverse discrete cosine transform which transforms the frequency domain data back to the time domain. The IDCT engine is required to run at a high output rate to reconstruct the original image at real time. Furthermore, since the IDCT 5 engine is typically located within a consumer product, cost is a major consideration. The IDCT engine needs to be designed to operate at high speed with minimal complexity. The digital image compression system typically processes the video signal on a frame-by-frame basis. Each video frame is further partitioned 10 into NxN blocks. In most compression systems, the block size is fixed by the system design to simplify the implementation of the DCT and the IDCT engines. Permitting variable block sizes can enhance the performance of the compression system under certain conditions to allow for optimal 15 compression of the image and/or to improve the quality of the reconstructed image. Variable block sizes can be used to take advantages of certain characteristics of the image. In the prior art, variable block size DCT and IDCT engines are designed with banks of transform processors of various sizes. Each processor computes a different block size transform on 20 the same data block. The transforms from the various processors are then combined into the desired composite transformed block. This approach can be unwieldy because of the large amount of required hardware and the complexity in coordinating the various hardware blocks. 25 SUMMARY OF THE INVENTION The present invention is a novel and improved variable block size 2 dimensional (2-D) inverse discrete cosine transform (IDCT) engine. In 30 accordance with the present invention, the NxN data block is transformed by columns by a first 1-D IDCT processor. The intermediate results from the first IDCT processor are temporarily stored in a transposition memory. Once all the columns have been processed, the intermediate results are transformed by rows by a second 1-D IDCT processor. The output from the 35 2nd IDCT processor comprises the transformed output of the IDCT engine. It is an object of the present invention to provide a 2-D IDCT engine capable of computing any arbitrary mix of transforms within an NxN data block. In the exemplary embodiment, each data block is either a 16x16 transform or a mix of any combinations of 8x8 transforms, 4x4 transforms, WO 99/10818 PCT/US98/17423 3 and/or 2x2 transforms. In the exemplary embodiment, a 21-bit control signal precisely describes the desired partition and informs the IDCT engine to compute the proper combination of transforms. In the present invention, different combinations of transforms can be easily performed by correctly 5 ordering the input data, selectively combining the data before the butterfly stages, and controlling the additions and multiplications at each stage of butterfly. The unnecessary butterflies are placed in the bypass mode. It is another object of the present invention to simplify the design of the 2-D IDCT engine by providing for serial computations. Serial adders and 10 bit-serial multipliers greatly simplify the design since the computation is performed on only one bit of data at a time. Serial computations also greatly simplify the routing crossbars between successive stages of butterflies. Because of the pipelined structure of the IDCT engine of the present invention, the throughput rate is maintained at the rate of one transformed 15 point or pixel per clock cycle. This is the same throughput rate as with parallel computations. Only the processing delay is increased because of the serial nature of the computations. It is yet another object of the present invention to minimize the memory requirement. For a 2-D IDCT, the data block is first transformed by 20 columns by the 1-D IDCT processor and the intermediate results are temporarily stored by columns in a transposition memory. The second 1-D transform is performed only after all the columns have been transformed. Because of the pipelined structure of the IDCT engine, the intermediate results are concurrently written to memory by columns and read from 25 memory by rows. To avoid writing over memory locations containing data which is needed later, the memory is transposed, or alternates between column major and row major, over successive NxN blocks. Using a read modify-write cycle, the intermediate result is read from a memory location and a new result is written to the same memory location within the same 30 clock cycle. The transposition memory reduces the memory requirement to one bank of memory of the same size as one NxN data block. BRIEF DESCRIPTION OF THE DRAWINGS 35 The features, objects, and advantages of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference characters identify correspondingly throughout and wherein: WO 99/10818 PCT/US98/17423 4 FIG. 1A, 1B, and 1C are diagrams of an exemplary NxN image, a partitioned image, and a tree diagram corresponding to the partitioned image of the present invention, respectively; FIG. 2 is a block diagram of an exemplary variable block-size 2-D IDCT 5 engine of the present invention; FIGS. 3A-D are exemplary diagrams of a 2-point IDCT trellis, a 4-point IDCT trellis, an 8-point IDCT trellis, and a 16-point IDCT trellis of the present invention, respectively; FIG. 4 is a block diagram of an exemplary 1-D IDCT processor of the 10 present invention; FIG. 5A and 5B are graphical diagram of a serial butterfly and block diagram of an exemplary implementation of the serial butterfly of the present invention, respectively; FIG. 6A and 6B are block diagrams of an exemplary bit-serial 15 multiplier of the present invention in word-wide representation and bit wide representation, respectively; FIG. 7 is a block diagram of an exemplary serial adder of the present invention; and FIG. 8 is a block diagram of an exemplary I/O buffer of the present 20 invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 25 Discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) are important complementary digital signal processing operations. The DCT transforms the sampled data from the time domain into the frequency domain according to the following equation: 30 2C1k " 7(2n + l)k X(k)- = 2C(k) x(n) cos ,r(2n (1) N n,=0 2N where N is the dimension of the transform, C(0) = 1/F2 and C(k) = 1 for k=1, 2, 3...N-1. DCT transform is typically performed on the sampled data as one 35 of a series of digital signal processing operations. Other operations are performed on the transformed data, including quantization, data compression, and error correcting coding. A discussion of an exemplary WO 99/10818 PCT/US98/17423 5 digital image compression technique is described in detail in the aforementioned U.S. Patent No. 5,452,104. The IDCT transforms the data from the frequency domain back into the time domain according to the following equation: 5 N-1 7(2n + 1)k x(n) = C(k) * X(k) cos 2N (2) k=O 2N The DCT and IDCT transforms are separable transforms. This means that a 2-D transform can be broken down into two 1-D transforms. A 2-D 10 IDCT transform can be performed on a data block by first performing a 1-D IDCT transform on the columns of the data block. The intermediate results from the first IDCT transform are temporarily stored in a memory element. A second IDCT transform is then performed on the rows of intermediate results. The output from the second IDCT transform comprises the 15 reconstructed pixels of the original image. Referring to the figures, FIG. 1A illustrates a diagram of an exemplary data block. Data block 2 is of size NxN where N is a power of two, or N= 2 x where x is an integer 1, 2, 3... When this condition is satisfied, equations (1) and (2) can be simplified significantly. In the exemplary embodiment, N is 20 equal to 16, although the present invention can be easily extended to other values of N. FIG. 2 illustrates an exemplary block diagram of the 2-D IDCT engine 10 of the present invention. In the exemplary embodiment, the input data block, comprising of IDCT coefficients, is provided by columns to IDCT 25 processor 20a. IDCT processors 20a and 20b are identical 1-D IDCT processors which perform the IDCT transform on the input data according to equation (2). The intermediate results from IDCT processor 20a are provided to memory element 22 where they are temporarily stored by columns. The intermediate results are then provided by rows to IDCT processor 20b. IDCT 30 processor 20b performs the 1-D IDCT transform and provides the transformed output, or the reconstructed image, to the subsequent digital signal processing block (not shown in FIG. 2). In the exemplary embodiment, the input data block is provided to IDCT processor 20a by columns and the intermediate results are provided to IDCT processor 20b by 35 rows. Alternatively, the data block can be provided to IDCT processor 20a by rows and to IDCT processor 20b by columns. In the exemplary embodiment, IDCT processors 20a and 20b are pipelined such that both IDCT processors 20 are active at the same time.

WO 99/10818 PCT/US98/17423 6 An important property of the IDCT is that a larger transform can be created by arranging the input data points, computing the sum on selective combinations of data points, and performing serial butterfly on the output of two smaller transforms. Serial butterfly is an operation which is described 5 in detail below. Thus, a 16-point IDCT is a butterfly of two 8-point IDCTs, each 8-point IDCT is a butterfly of two 4-point IDCTs, and each 4-point IDCT is a butterfly of two 2-point IDCTs. This property of the IDCT is well known in the art and is best illustrated by a trellis diagram. A trellis diagram of a 2 point IDCT is shown in FIG. 3A, a 4-point IDCT is shown in FIG. 3B, and an 10 8-point IDCT is shown in FIG. 3C. The 2-point IDCT comprises a single butterfly stage. As shown in FIGS. 3A-3B, the 4-point IDCT comprises a stage of two 2-point IDCTs, a stage of serial add before the 2-point IDCT stage, and a butterfly stage after the 2-point IDCT stage. Similarly, as shown in FIGS. 3B-3C, the 8-point IDCT comprises a stage of two 4-point IDCTs, a stage of 15 serial add before the 4-point IDCT stage, and a butterfly stage after the 4-point IDCT stage. The diagram of the 16-point IDCT trellis 100 of the present invention is illustrated in FIG. 3D. Trellis 100 is derived by B.G. Lee and described in detail in a book by K.R. Rao, entitled "Discrete Cosine Transform : 20 Algorithms, Advantages, and Applications", Academic Press, 1990. The 16 point IDCT trellis 100 comprises three stages of serial add and four stages of serial butterfly, with each stage of butterfly comprising eight serial butterflies. In the IDCT processors of the prior art, the interconnects between successive stages are fixed, thus limiting the IDCT processor to performing 25 only 16-point IDCT transforms. In the present invention, the stages are interconnected using reconfigurable trellis cross-connects. As shown in FIG. 3D, 16-point IDCT 110 is a butterfly of two 8-point IDCTs 108, with the data points to the lower 8-point IDCT selectively combined. Two 8-point IDCTs 108 is a butterfly of four 4-point IDCTs 106, 30 with the data points to the lower 4-point IDCTs selectively combined. Four 4-point IDCTs 106 is a butterfly of eight 2-point IDCTs 104, with the data points to the lower 2-point IDCTs selectively combined. The reconfigurable trellis cross-connects in combination with the bypass mode of the serial butterflies allow the 2-D IDCT engine 10 of the present invention to 35 compute any arbitrary mix of transforms within an NxN block. By correctly ordering the input data, selectively combining the input data before the butterfly stages, and controlling the additions and multiplications at each stage of the trellis, any combination of 2-point, 4-point, 8-point, and 16-point IDCT transforms can be performed. For example, IDCT processor 20 can WO 99/10818 PCT/US98/17423 7 perform two 8-point transforms, eight 2-point transforms, one 8-point transform and two 4-point transforms, or one 8-point, one 4-point and two 2-point transforms. In the present invention, there is no need for a combining stage to assemble the different transform output into a composite 5 transformed block since this happens automatically when IDCT engine 10 has been configured to do the appropriate mix of transforms. Any time a smaller transform is computed, the serial adders and butterflies not needed to do the higher order transforms revert to delay latches. Thus, the outputs from each IDCT processor 20 are time aligned regardless of the transform 10 mix. In the exemplary embodiment, the serial butterflies operate on two input bit streams and provides two output bit streams. The serial butterfly comprises one greatly simplified bit-serial multiplier and two serial adders. The serial structure of IDCT processors 20 permits the routing crossbars 15 between successive stages of serial butterfly to be implemented using only 1 bit wide data buses. In the exemplary embodiment, IDCT engine 10 computes the transform of the 16x16 block in 256 clock cycles. For each clock cycle, one IDCT coefficient is supplied to IDCT engine 10 and one output pixel is 20 extracted from IDCT engine 10. IDCT processors 20a and 20b are pipelined such that both processors are active concurrently. Each IDCT processor 20 receives one input data point and provides one transformed data point for each clock cycle. 25 I. IDCT Processor An exemplary block diagram of IDCT processor 20 of the present invention is shown in FIG. 4. Every 16 (N=16) clock cycles, the 16 I/O buffers 52 receive 16 input data points, one data point per clock cycle and one data 30 point per I/O buffer 52. The order in which the data points are loaded into I/O buffers 52 depends on the mix of transforms being performed and is controlled by controller 26 through the 4-bit WRITE_ENABLE signal. Each data point comprises q-bits which are loaded in parallel into the proper I/O buffer 52 based on WRITE_ENABLE. I/O buffers 52 then serially shift the 16 35 data points out together, one bit per clock cycle with the LSB first, through routing crossbar 54 to serial adders 56. I/O buffers 52 can be implemented as a parallel-to-serial shift registers as described below.

WO 99/10818 PCT/US98/17423 8 Serial adders 56 receive the data bits and perform serial additions of the bits in the manner described below. Serial adders 56 are enabled by ADD_ENABLE which, in the exemplary embodiment, comprises 7-bits and corresponds to the first three stages of add shown in trellis 100 in FIG. 3D. 5 Each serial add is represented by encircled dots 112 (only one encircled dot is labeled for simplicity). In the first stage, there are seven serial adds 112 which require four bits to enable/disable. In the second stage, there are two sets of three serial adds 112 which require two bits to control. And in the third stage, there are four sets of single serial add 112 which require one bit 10 to control. With the 7-bit ADD_ENABLE signal, serial adders 56 can be controlled to compute the serial adds 112 as required by the first three stages of trellis 100 in FIG. 3D. The bank of 16 serial adders 56 in FIG. 4 symbolically represents the functionality required by the first three stages of trellis 100. The outputs from serial adders 56 are provided to the first stage 58 of 15 eight serial butterflies which perform the functions shown within eight 2 point IDCTs 104 in FIG. 3D. Each serial butterfly is shown by the block diagram in FIG. 5B. The serial butterfly receives two serial stream inputs, X1 and X2, and produces two serial outputs Z1=X1+C-X2 and Z2=X1-C*X2, where C is a fixed scalar which is defined according to the position of the 20 butterfly within IDCT trellis 100. An exemplary implementation of the serial butterfly is described in detail below. In the exemplary embodiment, the first stage 58 of serial butterfly is always enabled such that IDCT processor 20 performs at least 2-point transforms. The outputs from the first stage 58 are provided to the second stage 62 of serial butterfly through routing 25 crossbar 60. Routing crossbar 60 interconnects the first two stages as shown by IDCT trellis 100. In the exemplary embodiment, the second stage 62 of serial butterfly can be selectively enabled to provide 4-point transforms. Since there are four sets of butterfly (see FIG. 3D), a 4-bit control is necessary to individually enable each set. The 4-bit control is part of the control signal 30 labeled as MAP in FIG. 4. The outputs from the second stage 62 of serial butterfly are provided to the third stage 66 through routing crossbar 64. The third stage 66 comprises two sets of four serial butterflies as shown within two 8-point IDCTs 108 in FIG. 3D. Each set can be individually enabled by a 2-bit control. 35 The outputs from the third stage 66 are provided to the fourth stage 70 through routing crossbar 68. The fourth stage 70 comprises one set of eight serial butterflies as shown within 16-point IDCTs 110 in FIG. 3D. The serial butterflies can be selectively enabled by a 1-bit control. The serial WO 99/10818 PCT/US98/17423 9 transformed data from the fourth stage 70 comprises the output from IDCT processor 20. The 1-bit serial transformed data from the fourth stage 70 is routed to a bank of serial-to-parallel output buffers. In the exemplary embodiment, 5 IDCT processor 20 provides the IDCT output in word serial fashion such that one output word is provided for each clock cycle. The output buffers can be combined with the input buffers to form I/O buffer 52 as described in detail below. 10 II. Controller Referring to FIG. 2, controller 26 provides the control signals to IDCT processors 20a and 20b and memory element 22. These control signals synchronize IDCT processors 20a and 20b and memory element 22 and 15 determine the reconstructed composite image. Controller 26 receives the ADDRESS input and the PQR input. The ADDRESS input informs controller 26 of the start of the data block. The PQR input comprises the three commands P, Q, and R which inform controller 26 of the desired block partition. In the exemplary embodiment, R equals to "1" indicates that the 20 16x16 block is to be divided into smaller 8x8 transform blocks, Q equals to "1" indicates that the 8x8 block is to be divided into smaller 4x4 transform blocks, and P equals to "1" indicates that the 4x4 block is to be divided into smaller 2x2 transform blocks. In the exemplary embodiment, each block can be individually divided without regard to other blocks in the image. Thus, 25 1-bit control is required for R since there is only one 16x16 transform block within the 16x16 data block, 4-bit control is required for Q since there can be four 8x8 transform blocks within the 16x16 data block, and 16-bit control is required for P since there can be sixteen 4x4 transform blocks within the 16x16 data block. An exemplary partition of a data block 4 is shown in FIG. 30 lB and an exemplary graphical illustration, e.g. tree diagram 6, of the PQR control corresponding to the image partition is shown in FIG. 1C. The 21-bit control PQR can be provided to controller 26 serially or in parallel. The PQR input is a 2-D representation of the desired block partition. Controller 26 parses the PQR input into 1-D column and row control signals. 35 These column and row control signals are then used to generate the control signals to command IDCT processors 20a and 20b to perform the proper mix of transforms. For the exemplary partition 4 shown in FIG. 1B, controller 26 commands IDCT processor 20a to perform two 4-point transforms and one 8 point transform for the first four columns of data. For the next two columns WO 99/10818 PCT/US98/17423 10 of data, controller 26 commands IDCT processor 20a to perform one 4-point transform, two 2-point transforms, and one 8-point transform. The process continues until all columns are processed. The intermediate results from IDCT processor 20a are stored by columns in memory element 22. 5 In the similar manner, controller 26 commands IDCT processor 20b to perform the proper mix of transforms on the rows of intermediate results from memory element 22. After all columns have been processed by IDCT processor 20a, controller 26 commands IDCT processor 20b to perform two 4 point transforms and one 8-point transform for the first four rows of 10 intermediate results. For the next two rows, controller 26 commands IDCT processor 20b to perform one 4-point transform, two 2-point transforms, and one 8-point transform. Again, the process continues until all rows are processed. Referring to FIG. 4, the control signals generated by controller 26 for 15 IDCT processors 20a and 20b include WRITEENABLE, READENABLE, ADD_ENABLE, and MAP. WRITE-ENABLE controls the writing of the input data points into the proper I/O buffers 52 such that the input data points are arranged in the correct order (see FIG. 3D). READ_ENABLE controls the order in which the transformed data is read from IDCT 20 processor 20. In the exemplary embodiment, the transformed data can be sequentially read from IDCT processor 20. ADD_ENABLE controls the first set of serial adders 56 which perform the adds in the first three stages of trellis 100. ADD_ENABLE is dependent on the desired mix of transforms and is generated in accordance with the PQR input. MAP controls the last 25 three stages 62, 66 and 70 of serial butterfly to produce the desired mix of transforms. MAP is also generated in accordance with the PQR input. Four control bits are needed for the second stage 62 to individually enable or disable each of the four sets of butterfly (see FIG. 3D). Similarly, two control bits are needed for the third stage 66 and one control bit is needed for the 30 fourth stage 70. In the exemplary embodiment, no control signal is needed for the first stage 58 since IDCT processor 20 always performs at least a 2 point transform. However, a control signal can be generated to provide bypass of the first stage 58 if this is desired or necessary. Since the 2-D transform of the present invention is performed serially using two 1-D 35 transforms, controller 26 delays the control signals to IDCT processor 20b, relative to IDCT processor 20a, to synchronize the control signals with the input data. Controller 26 can be implemented as a combination of combinatory logic and state machine. Alternatively, controller 26 can be implemented WO 99/10818 PCT/US98/17423 11 with a micro-controller or a micro-processor running a microcode. Different implementations of controller 26 to perform the functions as described herein are within the scope of the present invention. 5 I. Transposition Memory In the exemplary embodiment, memory element 22 can be implemented as a transposition memory. A 2-D transform is achieved by performing a 1-D transform on the columns of the input data block, storing 10 the intermediate results, and performing a 1-D transform on the rows of the intermediate results. The 1-D transform on the rows cannot be performed until all columns have been transformed. In the exemplary embodiment, the two 1-D transforms are pipelined such that both operate concurrently. Memory element 22 can be implemented as a block of memory as 15 illustrated in FIG. 1A. Assume that the intermediate results from IDCT processor 20a are initially written by columns to memory element 22. IDCT processor 20b cannot operate on the rows of intermediate results until all columns are operated on by IDCT processor 20a. Once the last column of memory element 22 is filled, the intermediate results are provided to IDCT 20 processor 20b in rows. However, because of the pipelined structure, IDCT processor 20a provides one column of data for each row of data retrieved by IDCT processor 20b. This column of data cannot be written over a previous column since some data points in the previous column are still needed by IDCT processor 20b. To resolve this problem, the new column of 25 intermediate results is written over the row of data just retrieved by IDCT processor 20b. In fact, memory element 22 can be implemented with read modify-write capability such that the same memory location can be read from and written to on the same clock cycle. Within one clock cycle, one data point can be read from one location of memory element 22 by IDCT 30 processor 20b and that same location can be written to by IDCT processor 20a. Implemented in this manner, memory element 22 is transposed, or alternates between column major and row major, over successive 16x16 blocks. The transposition reduces the memory requirement to only one bank of memory. 35 The control signals to implement memory element 22 as a transposition memory are provided by controller 26. Controller 26 has the necessary timing information and can synchronize IDCT processors 20a and 20b and memory element 22 with the input data block.

WO 99/10818 PCT/US98/17423 12 Memory element 22 can be implemented using storage element or one of any number of memory devices that are known in the art, such as RAM memory devices, latches, or other types of memory devices. 5 IV. Serial Butterfly The serial butterfly is shown in FIG. 5A and 5B. FIG. 5A is a graphical illustration of the serial butterfly and FIG. 5B is a block diagram of the same serial butterfly. Serial butterfly 140 operates on two inputs X1 and X2. Input 10 X1 is delayed by delay element 148 to align the top and bottom signal paths. Input X2 is scaled by 1/(2C) by bit-serial multiplier 150. C k denotes cos n. The outputs from delay element 148 and multipliers 150 are provided to serial adders 160a and 160b. Serial adder 160a adds the output from multiplier 150 to the output from delay element 148 and serial adder 160b 15 subtracts the output from multiplier 150 from the output from delay element 148. The outputs from the serial adders 160a and 160b comprise the serial butterfly outputs Z1 and Z2, respectively. In the present invention, serial adders 160a and 160b are designed such that the adders can be turned off to allow Y1 and Y2 to pass through as Z1 and Z2, respectively. In the 20 exemplary embodiment, serial butterfly 140 operates on two input bit streams and provides two output bit streams. An exemplary block diagram of bit-serial multiplier 150 is shown in FIGS. 6A and 6B. FIG. 6A illustrates bit-serial multiplier 150 in a word-wide representation and FIG. 6B illustrates the same multiplier 150 in a bit-wide 25 representation. A bit-serial multiply of X with C is achieved by successively adding C to the intermediate product term and shifting the result by one binary bit position. This is shown by the block diagram in FIG. 6A. Latch 212 is cleared by the LD signal, which is enabled for one cycle out of every 16 clock cycles, to prepare latch 212 for the next multiply. The LD signal also 30 loads parallel-to-serial shift register 214 with the product term of the just completed multiply from adder 210. The product term is then shifted out serially from register 214 during the next multiply. In the exemplary embodiment, the precision of the input data X, the constant C, and the product Y is 16-bit. 16-bit of precision results in less 35 arithmetic error than that specified by "IEEE Standard 1180-1990 : Specifications for the Implementations of 8x8 Inverse Discrete Cosine Transform." The 16-bit representation can comprise 1 sign bit, 9 magnitude bits, and 6 fractional bits. Other representations of less than 16-bit or more WO 99/10818 PCT/US98/17423 13 than 16-bit can be contemplated and are within scope of the present invention. In the exemplary embodiment, adder 210, latch 212, and register 214 are all implemented with 16-bit. For each clock cycle, one bit of X is shifted 5 into bit-serial multiplier 150, with the LSB bit first. Depending on the value of the input bit and the LD signal, the constant C is added to the intermediate product term which is stored in latch 212. Within logic circuit 200, AND gate 204 determines whether C is to be added to the intermediate product term based on the input bit and the LD signal. The intermediate 10 product term from adder 210 is then shifted by one bit position and stored back into latch 212, in bit positions D[14..0]. The LSB bit from adder 210 is discarded and the MSB bit in latch 212 is sign extended, e.g. D[15] = Co[15] where C[15] is the carry-out from the MSB of adder 210. As shown in FIG. 6A, bit-serial multiplier 150 can be implemented with the same amount of 15 hardware as an accumulator, which is very compact for IC design. Bit-serial multiplier 150 is shown in further detail in FIG. 6B. Adder 210, latch 212, and register 214 are shown in bit form. Depending on the value of input bit X and the LD signal, the constant C can be added to the intermediate product term which is stored in latch 212. Each adder 210 20 receives a carry-in (Ci) input from latch 212 of the next less significant bit and provides a carry-out (Co) output to adder 210 of the next more significant bit. This is a standard carry chain of an adder. The simple truncation of the LSB bits produces a slight negative bias on the twos complement output product term. This slight negative bias can 25 be offset by adding an LSB on the second to last adder 210a, which produces a half LSB of positive offset in the output product term. The overall offset can be minimized by alternating the truncation and positive offset over successive multipliers 150. The offset is controlled by the ROUND signal which can be hardwired high or low depending on the desired result. 30 An exemplary block diagram of serial adder 160 is shown in FIG. 7A. Serial adder 160 serially receives two inputs Y1 and Y2, with the LSB bit first. Serial adder 160 can add the two inputs (Y1+Y2), subtract one input from the other input (Y1-Y2), or bypass one input through to the output (Z=Y2). The addition or subtraction depends on the location of serial adder 160 within 35 the IDCT trellis, e.g. whether serial adder 160 is located in the upper leg or the lower leg of the butterfly. The bypass mode allows IDCT processor 20 of the present invention to perform different mix of transforms. The inputs Y1 and Y2 are provided serially to AND gate 240 and XOR gate 242, respectively. ADD_EN is also provided to AND gate 240. When WO 99/10818 PCT/US98/17423 14 ADD_EN is low, the output from AND gate 240 is low and Y1 is not provided to adder 244. When ADD_EN is high, Y1 is provided to adder 244. INVERT signal is provided to XOR gate 242 and register 246. To perform a subtraction, the input Y2 is converted to a negative number and added to the 5 other operand. Conversion of a twos complement number to a negative number requires inverting all the bits in the original number and adding a one to the LSB bit position. The inversion of the bits is performed by XOR gate 242 when the INVERT signal is high. The one is added to the LSB bit position of the input number by storing a one at the start of the serial add, 10 when the LD signal is enabled the INVERT signal is high, and adding this value to the carry-in (Ci) of adder 244. For each subsequent clock cycle, the carry-out (Co) from the prior 1-bit add is stored in register 246. The carry-out is added with the next set of bits from the two inputs Y1 and Y2. The sum output S from adder 244 15 represents the output from serial adder 160. The constant C can be hardwired or mask programmable. Since the first stage 58 of butterfly is always performed in the exemplary embodiment, the constant C for bit-serial multipliers 150 for this stage can be hardwired. However, for the remaining stages 62, 66, and 70 of butterfly, the constant C 20 can be mask programmable to allow multipliers 150 to perform multiplies of the input X2 with either 1/(2C) or 1, when serial butterfly 140 is placed in the bypass mode. Multipliers 150 can also be loaded with other values of C to perform scaling or normalization of the input X2. As represented in FIG. 7, serial adder 160 can perform adds, subtracts, 25 or bypass of the two inputs. Serial adder 160 can be modified to perform the functions as required by serial butterfly 140. For example, referring to FIG. 5B, serial adder 160a only performs adds or bypass. Therefore, serial adder 160 in FIG. 7 can be modified by providing Y1 directly to the B input of adder 244, eliminating XOR gate 242, and providing Y2 to AND gate 240. The 30 INVERT signal can be removed since adder 160a only performs adds. Similarly, serial adder 160b only performs subtracts or bypass. Therefore, the INVERT signal in serial adder 160 can be tied to a high reference. Serial adder 160 can also be used to perform serial adds and bypass as required by serial adders 56 in FIG. 4 which implement the serial adds 112 35 required by the first three stages of trellis 100 as shown in FIG. 3D. Referring to FIG. 5B, delay element 148 can be implemented with a series of latches. The number of latches is selected to match the processing delay of multiplier 150.

WO 99/10818 PCT/US98/17423 15 V. I/O Buffer In the exemplary embodiment, within each IDCT processor 20, a bank of 16 I/O buffers 52 receives the input data and provides the transformed 5 data. The input and output from IDCT processor 20 are provided in a word serial manner, or one complete data point per clock cycle. The 16 data points are loaded into the 16 I/O buffers 52 in 16 clock cycles. Once all I/O buffers 52 are loaded, the 16 data points are provided to the IDCT trellis in a bit serial manner, one bit per clock cycle. For each clock cycle, I/O buffers 52 also 10 receive the transformed data bits from the final stage 70 of serial butterfly. The transformed data is provided serially to I/O buffer 52, An exemplary block diagram of one I/O buffer 52 is shown in FIG. 8. I/O buffer 52 comprises 16-bit latch 262, 16-bit parallel-to-serial shift register 264, 16-bit latch 266, and output buffer 268. The IDCT input is provided to all 15 latches 262 within the 16 I/O buffers 52. Each I/O buffer 52 latches the IDCT input when directed by the control signal WR(w). WR(w) is decoded from the WRITE_ENABLE signal which originates from controller 26. Latch 262 within each I/O buffer 52 is enabled for only one out of every sixteen clock cycles. After the 16 data points have been latched by latches 262, the LD 20 signal is enabled and the values stored in latches 262 are provided to registers 264. For each I/O buffer 52, the LSB register 264a serially shifts the data point to routing crossbar 54, one bit per clock cycle with the LSB bit shifted out first. For each clock cycle, one transformed data bit is serially shifted into 25 the MSB register 264q, one bit per clock cycle with the LSB shifted in first. After 16 clock cycles, all 16 data bits are shifted out to routing crossbar 54 and all 16 transformed data bits are shifted into register 264. Every 16 clock cycles, the LD signal loads register 264 with the next data point and loads latch 266 with the transformed data point. The transformed data is stored in latch 266 30 until it is read out through output buffer 268. Output buffer 268 is selectively enabled such that the transformed data is provided serially from the 16 I/O buffers 52, one transformed data point per clock cycle. The order of the read is controlled by the RD(w) signal which is decoded from READ_ENABLE. 35 The block diagram in FIG. 8 represents one implementation of I/O buffer 52. Other implementations which perform the same functions as described herein can be contemplated and are within the scope of the present invention.

WO 99/10818 PCT/US98/17423 16 Although the present invention has been described in the context of a 2-D IDCT engine, the inventive concept can be extended to other transforms such as discrete Fourier transform (DFT), inverse discrete Fourier transform (IDFT), fast Fourier transform (FFT), inverse fast Fourier transform (IFFT), 5 discrete cosine transform (DCT), and Hadamard transform. Therefore, application of the inventive concept as described herein to other transforms are within the scope of the present invention. The previous description of the preferred embodiments is provided to enable any person skilled in the art to make or use the present invention. 10 The various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of the inventive faculty. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent 15 with the principles and novel features disclosed herein.

Claims

1. A variable block size IDCT processor comprising: 2 a bank of adders for receiving a plurality of input data points and a first control signal, said first control signal commanding said bank of adders 4 to perform additions on selected combinations of input data points; a plurality of stages of butterfly; and 6 a plurality of routing crossbars, one routing crossbar interposed between said bank of adders and a first stage of butterfly and one routing 8 crossbar interposed between successive stages of butterfly; wherein said plurality of stages of butterfly receive a second control 10 signal commanding said plurality of stages of butterfly to perform butterfly operations on selected inputs to said plurality of stages of butterfly.

2. The IDCT processor of claim 1 wherein said bank of adders and 2 said plurality of stages of butterfly are implemented with serial adders and bit-serial multipliers.

3. The IDCT processor of claim 2 further comprising: 2 a bank of I/O buffers for receiving said plurality of input data points in a word serial format and for providing said bank of adders with said input 4 data points in a bit serial format.

4. The IDCT processor of claim 3 wherein said plurality of stages 2 of butterfly are pipelined such that all stages are active concurrently.

5. The IDCT processor of claim 4 wherein said multiplicands of 2 said bit-serial multipliers can be programmed with a mask.

6. A variable block size 2-dimensional IDCT engine comprising: 2 a first IDCT processor, said first IDCT processor receiving input data points; 4 a memory element connected to said first IDCT processor; a second IDCT processor connected to said memory element; and 6 a controller connected to and providing control signals to said first IDCT processor, said second IDCT processor, and said memory element, said 8 controller receiving input signals and generating control signals in accordance with said input signals. WO 99/10818 PCT/US98/17423 18

7. The IDCT engine of claim 6 wherein said IDCT processors 2 comprise: a bank of adders for receiving said plurality of input data points and a 4 first control signal, said first control signal commanding said bank of adders to perform additions on selected combinations of input data points; 6 a plurality of stages of butterfly; and a plurality of routing crossbars, one routing crossbar interposed 8 between said bank of adders and a first stage of butterfly and one routing crossbar interposed between successive stages of butterfly; 10 wherein said plurality of stages of butterfly receive a second control signal commanding said plurality of stages of butterfly to perform butterfly 12 operations on selected inputs to said plurality of stages of butterfly.

8. The IDCT engine of claim 7 wherein said bank of adders and 2 said plurality of stages of butterfly are implemented with serial adders and bit-serial multipliers.

9. The IDCT engine of claim 8 wherein said IDCT processors 2 further comprise : a bank of I/O buffers for receiving said plurality of input data points 4 in a word serial format and for providing said bank of adders with said input data points in a bit serial format.

10. The IDCT engine of claim 9 wherein said IDCT processors are 2 pipelined such that both IDCT processors are active concurrently.

11. The IDCT engine of claim 10 wherein said first stage of butterfly 2 is always enabled.

12. The IDCT engine of claim 11 wherein said multiplicands of said 2 bit-serial multipliers can be programmed with a mask.

13. The IDCT engine of claim 12 wherein said IDCT engine has a 2 throughput rate of one output pixel per clock cycle.

14. The IDCT engine of claim 13 wherein said serial adders and bit 2 serial multipliers have resolution of greater than 8-bit. WO 99/10818 PCT/US98/17423 19

15. The IDCT engine of claim 14 wherein said serial adders and bit 2 serial multipliers have resolution of 16-bit.

16. The IDCT engine of claim 15 wherein said memory element 2 comprises a transposition memory.

17. An apparatus for performing a variable block size 2 2 dimensional IDCT transform comprising : first IDCT transform means for performing a 1-dimensional IDCT 4 transform of a plurality of input data points; memory means for storing intermediate results from said first IDCT 6 transform means; and second IDCT transform means for performing a 1-dimensional IDCT 8 transform of said intermediate results; and controller means for providing control signals to said first IDCT 10 transform means, said second IDCT transform means, and said memory means, said controller means receiving input signals and generating said 12 control signals in accordance with said input signals.

18. The apparatus of claim 17 wherein said IDCT transform means 2 comprises : a stage of adder means for receiving said plurality of input data points 4 and a first control signal, said first control signal commanding said adder means to perform additions on selected combinations of input data points; 6 a plurality of stages of butterfly means for performing butterfly operations on pairs of input data; 8 routing means for routing signals between said stage of adder means and said plurality of stages of butterfly means; 10 wherein said plurality of stages of butterfly means receive a second control signal commanding said plurality of stages of butterfly means to 12 perform butterfly operations on selected pairs of input to said plurality of stages of butterfly means.

19. A transform engine comprising: 2 a first transform processor, said first transform processor receiving input data points; 4 a memory element connected to said first processor; a second transform processor connected to said memory memory; and WO 99/10818 PCT/US98/17423 20 6 a controller connected to and providing control signals to said first transform processor, said second transform processor, and said memory 8 element, said controller receiving input signals and generating control signals in accordance with said input signals.

20. The transform engine of claim 19 wherein said transform 2 processors comprise : a plurality of stages of butterfly; and 4 a plurality of routing crossbars, one routing crossbar interposed between successive stages of butterfly; 6 wherein said plurality of stages of butterfly receive a second control signal commanding said plurality of stages of butterfly to perform butterfly 8 operations on selected inputs to said plurality of stages of butterfly.