GB2264609A

GB2264609A - Data compression

Info

Publication number: GB2264609A
Application number: GB9304142A
Authority: GB
Inventors: Martin P Boliek; James D Allen; Steven Michael Blonstein
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1992-02-28
Filing date: 1993-03-01
Publication date: 1993-09-01
Anticipated expiration: 2013-03-01
Also published as: GB2264609B; GB9304142D0; DE4306010A1; JPH0646269A; JP3155383B2; DE4306010C2

Description

.DTD:

2264609 AN APPARATUS AND METHOD FOR COMPRESSING STILL IMAGES The present invention relates to an apparatus and corresponding method for compressing still images which is compatible with a JPEG (Joint Photographic Experts Group) Still Image Compression Standard.

.DTD:

When high-quality images must be compressed to save memory or transmission requirements, it is common practice to first transmute the images to another space where the i0 information can be represented more compactly. This is usually done block-by-block with a linear transformation (matrix multiplication); a typical arrangement is to perform 8-point transforms on row segments of 8 pixels and then to perform 8-point transforms on the 8-element column segments 15 of this row-transformed image. Equivalently a single 64-point transform cn be performed on a pixel block of 64 pixels arranged in an 8-by-8 block.

.DTD:

i0 A good choice for a one dimensional transform is the discrete Chebychev transform:

.DTD:

7 F(u) = c(u) su f(i) cos u(2i+l)pi/16 i=o where C(u) = { sqrt(2)/8 for u = o { 2/8 otherwise There are several advantages of this transform including a) the compression is near-optimal by some measures, b) there are fast computational algorithms for performing this transform and its inverse, and c) deblurring Cenhancement of the initial image) can be readily performed in the transform space, given certain assumptions described in Reference [I].

.DTD:

2.0 Objects and Summary of the Invention .DTD:

It is an object of the present invention to provide an apparatus and corresponding method for compressing still images.

.DTD:

It is a more particular objec to provide an apparatus and corresponding method for compressing still images while still compatible with a JPEG Transform.

.DTD:

I0 Another object of this invention is the optimization of the use of bits in he quanizaion and scaling steps of data compression. Another object of this invention is the minimization of mean square error in data compression schemes where quantization and coefficient scaling are combined. A further object of this invention is the utilization of a fixed nttmber of bits in a manner which optimizes the range and resolution of data compression. Another object of this invention is to meet the H.261 specification for resolution for small quantization values. In particular, an object of this invention is to provide a scheme for using a 16 to I multiplexor and a 16 bit multiplier to allow quantization pre-scaling with a dynamic range of 28 bits.

.DTD:

Another object of this invention is the utilization of the speed of the Generalized Chen Transform to maximum advantage in a pipelined implementation oZ the process. A:further object of this invention is to inimize the number of gates req/ired to perform a transform. In particular, an object of this invention is to take advantage of the speed of the adder network section of the transform to perform the vertical and horizontal transform addi:ions with the same hardware.

.DTD:

Additional objects, advantages and novel features of the present invention will be set forth in part in the description which follows and in part become apparent to those skilled in the art upon examination of the following, or may be learned by practice of the invention The objects and advantages of the present invention may be realized and attained by means of the instrumentalities and combinations which are pointed out in the appended claims.

.DTD:

7.0 Brief DescriDtiom of the Drawims The accompanying drawings which are incorporated in and form a part of this specification illustrate an embodiment of the invention and, together with the description, serve to explain the principles of the invention.

.DTD:

Fig IA shows a block diagram of a compressor, and Fig. iB shows a block diagram of a decompressor, according to the present invention.

.DTD:

Figs. 2A-2C show input pixel ordering, block timing and vector timing of data according to the present invention.

.DTD:

Fig. 3 shows a three-point transform of RGB to XYZ data.

.DTD:

Figs. 4A and 4B show possible VLSI layouts according to the present invention.

.DTD:

Fig. 5 shows a diagram of a shift register utilized with the present invention.

.DTD:

Fig. 6A shows a diagram of a shift array according to the present invention.

.DTD:

Fig. 6B shows an example of the shift array of Fig. 6A.

.DTD:

Fig. 7 shows a diagram of combined data flow.

.DTD:

Figs. 8A and 8B show diagrams of forward add arrays according to the present invention.

.DTD:

Fig. 9 shows a diagram for two dimensional generalized Chen transform according to the present invention.

.DTD:

!5 Fig. I0 shows a block diagram of a preferred embodiment of the present invention.

.DTD:

Fig.!! shows a hardware implementation of the inverse pre-scling and quantization with a shift before the multiplication.

.DTD:

Fig. 12 shows a hardware implementation of the inverse pre-scaling and quantization with a shift after the multiplication.

.DTD:

o Fig. 13 is a flow chart showing a conventional implementation of a twodimensional DCT computation.

.DTD:

G Fig 14 is a flow chart depicting an implementation of the twodimensional Generalized Chen Transform which takes advantage of the speed of the transform of the present invention.

.DTD:

4.0 Theoretical DiscussioD of the nventoq A complete system for the compression and reconstruction of images may appear as follows in Table i.

!0 (Table i) .DTD:

A) B) Z) (optional) 64 input pixels I I V Discrete Chebychev (or similar Transform on Rows I I V Discrete Chebychev (or similar Transform on Columns V C) Multiply by Rate Scaler I I V D) Multiply by psychoadaptive weights I ! v E) Multiply by Deblurring Weights I I V Y) Threshold, Quantize, Encode and Transmit I I V G) Receive, Decode, and Interpolate I I V H) Multiply by Inverse Rate Scaler < I I V I) Multiply by inverse psychoadaptive weights I I V J) Inverse Discrete Chebychev Transform I I v K) Inverse Discrete Chebychev Transform I I V L) Smooth Pixel Block Boundaries <- I I V 1 ! 64 reconstructed pixels --> neighbors Classify Difficulty I I I I <-/ \................... ' mw | ! Table I above describes the invention and, with optional steps (L,Z) omitted, current technology as well.

.DTD:

The multiplication by the deblurring weights (step E) can also be performed as a decode step (e.g., after step I).

.DTD:

The deblurring is done to compensate for the point-spread function of the input device. It must be tuned to the device or omitted when the input image has already been enhanced. There are other better ways to deblur the image, but he meKhod shown here is computationally cheap and is appropriate in some applications, eog. color copier.

.DTD:

It is possible to arrange the computation of the forward transform (A,B) so that much of the computational workload consists of a final multiply stage. By precomputing the products of these multipliers and those in steps (CE) the compression process can be expedited.

.DTD:

Similarly it is possible to arrange the computation of the inverse transfo.m (J,R) so tha much of the computation workload consists of a preliminary multiply stage Again by precomputation of the products, the computational effort of steps (H,!) is effectively eliminated.

.DTD:

Furthermore another transform is substituted for the 2-D DCT transform, yielding further computational simplicity.

.DTD:

Furthermore the psychoadaptive weights can be selectively varied to make the combined multipliers for steps (BD) computationally efficient, eg powers of two. Small changes in the psychoadaptive weights of the lowenergy output transfor elements have little effect on image uality or compression rate.

.DTD:

i0 Finally, attention is called to the steps (L,Z) in Figure 1, Classification of Image Difficulty and Smoothing Block Boundaries. Since these are optional and independent of the primary invention, they are given minimal discussion in this paper.

.DTD:

4.1Chen Algorithm The one dimensional Chen algorithm [3] states that X = 2/N AN x (i) where x is a data vector, X is a transformed vector, and A is the following AN = c(k) cos((2j +!)k/2N); j, k = 0,1,2,...,N-l. Further, AN can be decomposed in the following manner ! I l AN 0 i AN = Z ', ', B. (2) t 0 t i RN t where Pa is the follDwing R,a " c(2k l) cos((29 l)(2k l)=/2N); 9' k " 0,I,2,o..,N/2-I. (3} /o -0 Note that the matrix, Z, is Chen's matrix, P. The notation has been changed to avoid confusion with the matrix, P, in the present application.

.DTD:

4.. Eight point IN = 8) D Chen Transfo ExamDe To do an eight point the Chen algorithm, equation 2, is used twice recurs!rely. The first!tern!ion uses the matrices Zs,, and Bs. The second iteration solves for and uses the matrices Z, R=, A2, and B. These are easily derived from the above equations or the Chen paper [3].

.DTD:

-kL = Z ! ', A. 0 ', l Z4 ' ' B4 0! I I I! i l 0 Rz! ' Bz ! t i I ! ! l 0 I %'here Z| = ! 0 0 0 0 0 0 0 l ! I ! 0 0 0 0 1 0 0 0 a ! I ! 0 0 0 0 0 0 0 ' i -- I 0 0 0 0 0 1 0 0 ' !! 0 0! 0 0 0 0 0 ' I I 0 0 0 0 0 0 1 0 ' I t 0 0 0 1 0 0 0 0 ' l t 0 0 0 0 0 0 0 1 ' I I !! ! I I Bz = j l! l l l l l l II ! 1 0 0 0 0 0 0 1, 0 1 0 0 0 0 0 ' l 0 0 1 0 0 1 0 0, 0 0 0 1 1 0 0 0, l 0 0 0 1 -I 0 0 0, 0 0 i 0 0 -i 0 0, ! 0 1 0 0 0 0 -i 0 l l 1 0 0 0 0 0 0 -1, !0 !5 I I t: = I I I I I C1 C4 (C7+C!) c3 c4(c5-c3) c5 -c4 (c5+c3) c7 c4(c7-cz) c4(cl-cv) c7I -c4(c3+c5) -c5: c4(c3-c5) c3: c4(cz+c7) -cz:

.DTD:

I t! 0 0 0 ' l Z4 =! 0 0 1 0 ' l! J 0 1 0 0 l !! J 0 0 0! ' ! l I l 1 0 0 ' --! B4 - J 0 1! 0 ' 0 1 -i 0 ' ! l 1 0 0 -! ' ! l R = i, C2 C6 ' C6 -C2 a !! A, =: i/<2 i/<2 ' o! 1/V2 -i/2 ' where, from equation 3, Cn = cos(n=/16) I0 9.1.2 Chen-Wu modiied) or Parametrzed Transform Thus far all that has been done is the Chen transform One could multiply it out and realized a computational savings over a brute force, multiply intensive DCT implementation. This is not what applican has provided, however. To reduce the number of multiplications to the bare minimum, the matrices are reparametrized to the following This is what applicant clls the Chen-Wu (modified), which is applicant's creation.

.DTD:

a rC!+a) rCa-1) i II i c r(!-c) -r(!+c) -!,", i -_(!+c) (c-l) c,'', 1 r(!-a) r(a+!) -a '' ! | L = RE, RF4 R2 = iv (b2+!) ' b l ' ! l { -b ' l l ! V (a2+l) 0 0 0, 0 i/V (c'+l) 0 0 ', ! o o i/v(c2+i) o, o o o!/vCa+1) I --!! I! -i ' !! where a CI/C7 cos(/16)/cos(7/16) = tan(7/16), b = C2/C6 tan(6=/16), c = C3/C5 = tan(5/16), r = C4 = cos(4=/16) o (4) lO !5 Note that the diagonal matrix, RF4, contains the normalization factors of the unparametrized matrix, RA. Also note that a diagonal matrix can be made of the constants in R and A:.

.DTD:

Upon reconstruction of the As matrix two matrices are kept distinct. The diagonal matrices are kept separate from the main matrix. The main matrix is multiplied by the B terms. After the appropriate reordering and multiplication by the constant term, e uuazion! reduces to the following.

.DTD:

X = Q(a, b, c) P(a, b, c, r) x where Q( b. e) = Itl/2 W2 0 0 0 0 0 0 0, ' 0 1/2g(a+l) 0 0 0 0 0 0 ' 0 0 I/2V(b2+l) 0 0 0 0 0 I 0 0 0 i/2V (c=+l 0 0 0 0 ',C5:

.DTD:

0 0 0 0 i/2W2 0 0 0 iI 0 0 0 0 0 I/2W (c2+i) 0 0 ' i z 0 3 0 0 0 0 I/2V (b:+l) 0 ' i i 0 0 0 0 0 0 0 I/2V (a2+!) ' ! i P(arb,c,r) = 1 1 1 1 1 1 l a r(a+l) r(a-l) 1 -i r(l-a) b 1 -I -b -b -i ! ! C r (l-c) -r (c+l) -i 1 r (c+l) 1 -I -i 1! -i ! i- -r(c+l) r(c-l) c -c r(l-c) ! 1 -b b -I -i b ! 1 r (!-a) r (a+l) -a a -r (a+l) ! i i:

.DTD:

-r(a+1) -i', i b:

.DTD:

r(c-!) -cI(6) -l I:

.DTD:

r (c+i) -!:

.DTD:

r (l-a) -11 4. The Gemeraized Transform The Generalized 8-poin DCT transform is determined by four parameters a, b, c, and r and can be written as T(a,b,c,r,) = P(a,b,c,r) X Q(a,b,c) where P() and Q() are as above.

.DTD:

The image transformation requires two such transforms T, namely Tv and Th, to transfo.m the image verZically and horizontally respectively The comp!ee two-dimensional transform is represented by t IF] = [T] [f] [T] v h here f is the input image block, F is the output transform... coefficients, and the superscript "t" denotes matrix transposition. Here all mazrices are 8 by 8o Since a diagonal mazrix (such as Q) is its ownn transpose, and t t t [A] [B] = ([B][A]') for all matrices, and i0 [T] [P] [ Q], V v v [T] = [p] [Q], h h h we can rite t IF] = [Q][p] v v which reduces to [] [P] [Q] h h F(i,j) = q(i,j) (g(i,j) where t [g] =. [P] [f] [P] v h and q(i,j) = Q (i,i) Q (j,j) v h (v) 3.

.DTD:

When transforming an image block, we shall solve for [g] using the ChenWu a!gorith/u and then multiply by the factors a(i,j) Given P = P(a, b, c, r) v V and P = P(a, b, c, r) h h the inverse of the above transfo--mation is expressed by If] = [P'] [Q] IF] [Q][p,] v v h h t where P' = P(a b, C, 1 / 2r) v v and P' = P(a, b, c,! / 2r) h h Again solution is via the Chen-Wu algorithm.

.DTD:

4.2.1 Chen's Aluorithm Several methods have been devised to speed up the computation of the I-D or 2-D Chebychev Transforms and their inverses.

.DTD:

"0 There is a well-known algorithm (Chen) [2,3] which multiplies an arbitrary 8-tuple by the matrix T above using only 16 multiples, 13 additions and 13 subtractions. This algorithm does not rely on any special properties of the parameters a, b, c and r.

2. Cben-Wu Alaorithm Modified[ By factoring [T] = [P] [Q] as above, Chen's algorithm is split into two sages, with 8 multiplies used in the multiplication by [Q], 8 multiplies and the rest of the arithmetic used in the multiplication by [P]o This is a consequence of our choice for [Q]; several elements of [P] become 1 or - ! and a multiplication vanishes.

.DTD:

As indicated above, similar simplications apply to the inverse transform, the 2-D transform, and the inverse 2-D transform. For 8-by-8 blocks, 128 multiplies are used for either the forward or l0 reverse 2-D transform (excluding the multiplies by [q]) en the internal dataflow of Chen's algorithm is viewed, these multiplies are embedded in a sructure of eight add/subtract stages and four multiply stages.

.DTD:

It is important to stress that the Chen algorithm operates regardless of the parameters a, b, c and r However the 8-point DCT employed in prior art has the parameters of the "true cosine transform":

.DTD:

a = tangent b = tangent c = tangent (7pi/16) (6"pi/16) (5piIl6) r = sqrt (i/2) = 0.70710678...

.DTD:

with the choice of r necessary and sufficient for matrix T to be orthogonal.

.DTD:

4.3 Choice of Parameter Values The Chert transform works regardless of the values selected for parameters a, b, c; and r. This is because the transform created by QP is orthogonal. It is completely possible to use any numbers and have a Zransfo-m that will perform the desired decorrelation of the image data necessary for compression. Note that this transfo-m is not a Discrete Cosine Transform nor is it an approximation of a DCT. It is its own transform.

.DTD:

l0 !5 However, for efficient decorrelation of the input image, and for transformation into relatively meaningful spatial frequency coefficients, it is generally agreed that the DCT is very desirable [53. Thus, to achieve the benefits of the DCT the parameters are set to approximate those of the DCT given in equation 4. The opposing factor is efficiency of computation. Since an add is cheaper than a multiply (in hardware the savings is silicon real estate, in sofzware it is number of cycles) the parameters are chosen to be computationaly efficient.

.DTD:

4 Alternative Alaorithms Other computational solutions have been devised for the Discrete Chebychev Transform. For example an algorithm due to Lee performs the 8- point 1D and the 64-point 2D transforms with 12 and 144 multiplies respectively [4,5].

.DTD:

However there are several disadvantages of these "faster" algorithms compared with the Chen algorithm:

.DTD:

a) The simplication T = P x Q (and the similar factoring for the reverse transform) is no longer operative. Separating the diagonal matrix Q is essential for the simplifications which follow.

.DTD:

b) These algorithms do not function with arbitrary parameters a, b, c and ro Instead they rely on various trigonometric identities valid specifically for Zhe true cosine parameters.

.DTD:

c) These algorithms have a more difficult structure. This can impede engineering and increase the potential for numeric instability.

.DTD:

l0 4.5 Discussion of the Invention A] Referring again to Table I, it will be noted that steps (C,D,E) can be folded into the forward transform post-multipliers derived from [Q]. similarly the steps (H,I) can be folded into the inverse transform pre- multipliers. This is because the rate scalar operation, psychoadaptive weights operation (commonly known as quantization values), and deblurring weights operation are all point multiply operations. If b, c, d, e are the outputs of steps B, C, D, E respectively then c(i,j) = b(i,j) q(i,j) d(i,j) = c(i, j) r(i,9) = b(i,j) q(i,j) r(i,j) e(i,j) = d(i,j) u(i,j) = b(i,j) q(i,j) r(i,j) u(i,j) or e(i,j): b(i,j) all(i,j) where all(i,j) = q(i,j) r(i,j) u(i,j) (8) and q(i,j) is the rate scalar, r(i,j) are psychoadaptively chosen (or even user chosen) quantization weights, and u(i,j) are deblurring weights.

.DTD:

Similarly steps H and I can be combined.

.DTD:

This effectively means the= the rate scaling, adaptive weighting and deblurring functions are provided with no additional computational overhead. As noted above, this approach is not applicable with the "faster" algorithms like Lee's.

.DTD:

B] Since Chen's algorithm operates with any parameters a, b, c and r, we will choose values which offer quality and compression similar to the DCT but which lead to high-speed multiplication.

.DTD:

l0 i5 The following parameters are reasonably close to those of the DCT but much more efficient computationally: a=5o0 b=2.5 c= 1.5 r= 0.75 Multiplication is now replaced ith much simpler arithmetic. Multiply-by-5 for example becomes copy;shift-left-2;addo Multiply by 1.5 becomes copy; shift-right-!;add.

.DTD:

Alternatively, the inverse numerator of a rational multiplier can be factored into the combined multiplier [q]o Thus the multiply by 2.5 can become multiplies by 5 and 2 for affected and unaffected te.ms respectively.

.DTD:

With this latter idea, handling of parameter r 0.75 in the straightforward Chen algorithm requires 96 multiplies by 4 and 32 multiplies by 3. With the Wu-Paolini in a 2D implementation improvement an entire multiply stage is eliminated and this becomes 36 multiplies by 16, 24 multiplies by!2, and 4 multiplies by 9o (The inverse transform uses 36 multiplies by 9, 24 multiplies by 6, and 4 multiplies by 4.) i0 For a cost of computational speed, parameter values even closer to the cosine transform can be selected. The substitutions b = 12/5 and/or r = 17/24 are possible. Another interesting alternative is:

.DTD:

rRow = 0.708333 (17/24) rCol = 0.7 (7/10) Here slightly different transforms (different parameter r) are used for the row and co!Tns. This is done to simplify the multipliers derived in the Wu-Paolini method. Here that method yields 36 multiplies by 15, 12 multiplies by 85/8, 12 multiplies by 21/2 and 4 multiplies by 119/16. (The inverse transform uses 36 multiplies by 119/16, 12 multiplies by 85/16, 12 multiplies by 21/4, and 4 multiplies by 15/4.) In the fashion just described all multiplies have been rendered fast and inexpensive except for the combined multiplier [q] in the compressor and the combined multiplier [q] in the decompressor. Each of these requires one multiply per transform element. The latter is simplified in that the majority of transform coefficients will be zero, and most of the non-zero coefficients will be integers quite close to zero which can be specially handled.

.DTD:

l0 C] A further technique is used to reduce the computational cost of the combined [q] multiplier in the compressor. Since the rate scaler is actually an arbitrary value, it will be adjusted pointby-point to give all of the [q] matrix elements computationally simple values, eg powersor-two. These 64 adjustments need to be perfo-med only once (after the rate scaler and deblurring filters are specified).

.DTD:

For example, if an element (C) of the combined multiplier and the corresponding decompression multiplier, element (D) happen to be C = 0. 002773 D = 0.009367 the proximity C - = 3 / 1024 = 0.002930 might be found and used to simplify the multiplication. This gives C' = 3 / 1024 D' = D C / C'- = 0 008866 5.0 Detailed Description of the (Primary) Process .DTD:

Notes:

.DTD:

a) In the quantized transform space, it is convenient and efficacious to take the non-zero steps of the "AC" coefficient quantization to be of constant width (w), and to take the zero step to be of width (w,q).

.DTD:

Moreover, q = 2 is arithmetically convenient and is nearoptimal for quality across a broad range of compression factors. In the description we take = 2 ("double-width zero"), although the invention comprises any possible q.

.DTD:

i0 b) The following algorithm is designed for limited-precision two'scomplement binary integer arithmetic except for the intermediate determinations in steps (2), (4) and (8) which are done once with high precision arithmetic.

.DTD:

!5 Furthel-more, and with the additional exception of step (9.1) the integer multiplies described here are optimized for cost and speed. For exampl9 consider the multiplies by Nrr Nrc = Drr" Drc' = 1.75 4.25 = 7.4375 By choosing the identity 7.4375 = (8-1) (i + 1/16) the multiplication is done efficiently with shifts and adds.

.DTD:

c) The deblurring multiplies are shown here in step 8 but they should usually be done in step 4, if at all. In many applications, the decompressor does not "know" how or whether to deblur the image Note that the best values of Thr() depend on the input device and deblurring method.

.DTD:

A recommended approach is for the value m(i,j) (see step 8) to be calculated at compression time (step 4) and transmitted or stored as part of the compressed image.

.DTD:

d) There are several obvious ways to pmrallelize, timesequence or interleave the computations which follow.

.DTD:

The preferred method for a given hardware architecture is straightforward.

.DTD:

5.1 Example Pseudo Code Embodiment This part of the application is essentially an embodiment of the invention explained in text and pseudocode. There are several sections, including parametrization, calculation of a!l(i,j) as in equation 8 above, execution of the main body of the forward GCT0 calculation of the inverse all(i,j)s execution of the main body of the inverse GCT.

.DTD:

Io The parameters a, b, c, and r are shown on above Notice that there is an r value for both rows and columns Although he 2D GCT is a separable transform and can be executed in two 2%- passess there is no restriction that requires it to be symmetric. Therefore, the scaling factors can be asymmetric as shown.

.DTD:

The equations for the numerators N and denominators D show possible combinations of numerator and denominator that can equal the above values. The designer of the GCT implementation has leeway in the actual values used in the adder array. The choices of values are corrected for at the final multiplication stage.

.DTD:

Choose tan 7pi/16 -=a = Na / Da tan 6pi/16 -=b = Nb / Db tan 5pi/!6 -=c = Nc / Dc sqrt(0.5) -=rRow = Nrr/ Drr sqrt(0.5) -=rCol = Nr / Drc 0.5 / rRow = rRow' = Nrr'/Drr' 0.5 / rCol = rCol' = Nrc'/Drc', as the parameters of the Generalized Chen Transform as discussed above. The "numerators" N and "denominators" D need not be integral, though they will be chosen for computational convenience. Na = 5 Nb = 3 Nc= 1.5 Nrr = 1.75 Among several useful possibilities is:

.DTD:

Da = 1 Db = 1.25 Dc = l Drr = 2.5 Nrc = 4 25 Drc = 6 Nrr" = 1.25 Drr' = 1.75 Nrc" = 3 Drc" = 4.25 bu again, the invention comprises all rational approximations to the above tangents This calculates the normalization scalers needed.

.DTD:

!0 2.

.DTD:

Also -rite U(0) = U(4) = sqrt(0 5) U(!) = U(7) = 1/sqrt(NaNa + DaDa) U(2) = U(6) = I/sqrt(NbNb + DbDb) U(3) = U(5) = i/sqrt(NcNc + DcDc) -5 2 o Let i Debl(i,j) be an index on {0,I,2,3,4,5,6,7} denoting vertical position (in the image space) or sequence of vertical change (in the transfo-m space) be an index on {0,1,2,3,4,5,6,7} denoting horizontal position (in the image space) or sequence of horizontal change (in the transform space) denote the deblurring factors, or Debl() =! when not deblurringo Thr (i, j) M v(i,j) L(i,j) S denote the inverse psychoadaptive weights, e.g. as recommended by the CCITT. denote the rate scaler; here M = I (approx) for typical compression rates. denote the several luminance values in the image (spatial) space. denote the transformed luminance values in the transform (compressed) space. be an arbitrary small integer denoting the arithmetic precision used in the reconstriction.

.DTD:

!5 The psychoadaptive weights 1 / Thr(i,j) should be reoptimized for each set of parameters of he Generalized Chen Transform.

.DTD:

However the parameters given in step (I) above are sufficiently close to the CCITT parameters that the.same matrix Thr() is optimal.

.DTD:

4. Here g(i,j) is equivalent to all(i,j). Iterate through the 64 transform locations (i,j), solving for k(i,j) and s(i,j) to satisfy:

.DTD:

g(i,j) < s(i,j) " - M ufi) u) k(i,j) Zr(i) Zc(j) Thr(i,j) with the right-side as close to g(i,j) as possible, s(i,j) an integer, and where g(i,j) = 1.0, k(i,j) in {1,3,5,7,9} for i j < 4 g(i,j) = 0.9, k(i,j) in {1,3,5} for i+j < 4 g(i,j) = 0.7, k(i,j) = 1 for.i+j < 4 Zr(i) = i, Zr(i) = Drr, zc(j) = i, Zc(j) = Drc, Zr'(i) = I, Zr'(i) = Drr', zc'(j) =!, Zc'(j) = Drc, when i = 0, i, 2 when i = 4, 5, 6 when 3 = 0, I, 2 when = 4, 5, 6 when I = 0, I, 2 when i = 4, 5, 6 when 3 = 0, I, 2 when 3 = 4, 5, 6 or 3 or 7 or 3 or 7 or 3 or 7 or 3 or 7 The factors g(i,j) are intended to make the quantization bias independen of the choice size.

.DTD:

Execution of the Forward GOT Step 5 is the pseudocode execuuion of the forward transform.

.DTD:

The following steps perform a 2D transfol-m in an interleaved form.

.DTD:

Iterate through the image performing the following on each eightby-eight block of luminance values v(,):

.DTD:

5.1 Prepare the values H(i, o) M(i, 1) (i, 2) M(i, 3) E(i, 4) Ms(i) M6(i) = V(i, 0) + V(i,7) = V(i, I) + V(i,6) = V(i, 2) V(i,5) = vCi, 3) + V(i,4) = V(i, 3) - V(i,4) = V(, 2) - V(i, S) = V(i, I) - V(i, 6) M(i, 5) = M6(i) M5(i) M(i, 6) = M6(i) - MS(i) M(i, 7) = v(i, O) - v(i,7) for i = 0, 1,2,...,7 I0 !5 5.2 Prepare he values H(0, j) H(l, j) H(2, j) H(3, j) H(4, j) HS(j) H6Cj) H(S, j) H(6, j) H(7, j) = M[0, j) + M(7,j) = M(l, j) + MC6,j) = MC2, j) + M(5, j) = M(3, j) + M(4, j) = MC3, j) - M(4, j) = MC2, j) -MC5, j) = M(l, j) - M(6, j) = H6(j) + HS(j) = H6(j)= HS(j) = MC0, j) MC7,j) for j = 0,1,2,...,7 5.3 Multiply each H(i,j) by (if i = 0, 2, 3 or 4:) Nrc Drc i (no action) if j = 5 or 6 if j = 4 or 7 if j = 0,; I, 2 or 3 (if i = 4 or 7:) Drr Nrc Drr Drc Drr (if i = 5 or 6:) Nrr Nrc Nrr Drc Nrr if j = 5 or 6 if j = 4 or 7 if j = 0, i, 2 or 3 if j = 5 or 6 if j = 4 or 7 if j = 0,!, 2 or 3 5.4 Prepare the values E(0, j) = H(0,j) + H(3,j) E(I, j) = H(7,j) + H(5,j) .., It It II II II II II II MIMMMM , .oo Nhl 0+1+1+1 1 II II II II II II II II .,...' O II I'- h 0 H..H.H..H.d.H + I "J" " ' I I + OOrl H-H.H.H.,H.H,H II II II II II II It II g.g+,+,+, r I.I......

.DTD:

H..H...

.DTD:

"rl'rt II II II II II II II II H..H'H''.' t-.

.DTD:

o II ri h 0 q -I u) I, I u) 1 4J X u ( KI u 0 4j 3 "" 1.} u) O R O iJ.dI:::: la -I o -I (o 0,,--I Q) o} g h la, [-t h-, I/) u o o 4 I-.,-I 4J U),---I I::: n -t.rt -t N 1 0 "el or'l h I11 0 I'd 4-1 14 A1 = X0 + X7 A2 " X3 + X4 A3 = X2 + X5 A4 = X1 + X6 A5 = X0 - X7 A6 = Xl - X6 A7 = X2 - X5 A8 = X3 - X4 B1 A1 - A2 B2 A1 + A2 B3 = A3 A4 B4 = A4 - A3 B5 = A6 + A7 B6 = A6 - A7 C1 = 1.25 B1 C2 " 3 B1 C3 = 1.25 B4 C4 = 3 B4 C5 = 1.5 A5 C6 = 1.0525 B5 C7 = 1.0625 B6 C8 = 1.5 A8 i0 D1 = C5 + C6 D2 = C5 - C6 D3 = C7 + C8 D4 = C7 - C8 E1 = 2.5 D1 E2 = 1.25 D2 v-3 = 2.5 D3 E4 = 1.5 D4 Y0 " B2 + B3 Y1 = E1 + (0.5 D3) 2 = C2 + C4 Y3 " E2 + D4 Y4 = B2 - B3 Y5 = D2 - E Y6 = C1 - C4 Y7 = (0.5 DI) - E4 Note that all the multiples in these e uuations are executed with shit and add operations.

.DTD:

To relate this to the matrix fol-m of the GCT, the vector point 6 is demonstrated as an example.

.DTD:

Y6 " C1 - C4,: (1.25 i) - (3 B4)," 1.25 (AI - A2 - 5 {A4 - A3) = 1.25 ((X0 + X7) - (X X4)) - ((XI X6) - (X2 X5)) 1.25 X0 - 3 X1 3 X2 - 1.25 X3 1.25 X4 3 X5 - 3 X6 1.25 X7 Y&/I.25 "= X0 - 2.4 X1 + 2. X2 - X3 X4 2. X5 - 2.4 X6 + X X = I 1 -b b -i 1 b -b 1, where b = 2.4 This is the sixzh row of the matrix P in equation. Note that the division by 1.25 is a scaling factor that is colleced in the rate scaler matrix.

.DTD:

The row data of an 8x8 pixel block is passed through this adder array The resulting one dimensional frequency components are transposed and passed through the same array again i0 -5 6 After step 5.5 in each image subblock and for each of the 64 locations (i,j), using k(i,j) and s(i,j) from step (4), prepare the value -s(i,j) L(i,j) = G(i,j) k(i,j) 2 but f this is negative (or i = j = 0), add 1 to it. This result is the transform coefficient L(i,j).

.DTD:

Corments about Step 6:

.DTD:

The calculations here are simple because - k(i,j) is always I, 2, 5, 7 or 9 and is usually!.

.DTD:

- multiplication BY 2^(-s(i,J)) is simply a right-shift. very large.) (or perhaps a left-shift if M was chosen Arithmetic right-shifts always round downwards. Rounding towards zero is actually desired; hence the clause "if (negative) add l".

.DTD:

The addition of 1 when i = j = 0 relies on v(i,j) ≥ 0 and is just a device to simplify the statement of step (9.1) below.

.DTD:

i0 7. Encode, store and/or transmit the values L(i,j).

.DTD:

Eventually they will be retrieved and the imaqe reconstructed with the following steps.

.DTD:

8. Ts is the inverse version of all( j) Iterate through the 64 transform locations (i,j), solving for m(i,j) as the nearest integer to 2 2 m(i,j) = U{i) Uf) Zri) Zcf5% Deb!{i,) 4-S-s (i, j) Zr'(i) Zc'(j) k(i,j) 2 where s(i,j) and k(i,j) are solved in step (4) above, and where the expressions "Z" are defined in step (4).

.DTD:

Also choose A(i,j) as the nearest integer tO S-2 A(0,0) = 2 -0.5 n(0,0) Drc' Drr' A(i,j) = m(i,j) (25- i- j) / 64 for i=O or j=o Comments about Step 8:

.DTD:

The values m(i,j) may have been precalculated above in step (4) and transmitted with the compressed image. This is not needed for A(i,j) which depend only on constants and m(i,j). In applications where the rate scaler and deb!urring weights are fixed the values m(i,j)-and A(i,j) will be constant.

.DTD:

The factor 2^S reflectsextra bits of precision which will be subsequently removed by arithmetic right-shifts in seps (9.2) and (!0).

.DTD:

The adjustment to A(0,0) corrects a rounding bias to allow use of the outputs below without rounding correction As given here, A(0,0) relies on the addition of 1 co L(0,0) in step (6) The interpolation "(25 - i -j) / 64" is heuristic but is approximately optimal in a mean-squared-error sense.

.DTD:

Once again, the 20 interleaved version.

.DTD:

Iterate through the transformed image performing the follouing on each eight-by-eigh block of transformed luminance values L(,) as derived in step (5) above:

.DTD:

S 9.Io Prepare the values E(i,j) = L(i,j) m(i,j) + A(i,j) for L(i,j) > 0 E(i,j) = L(i,j) m(i,j) - A(i,j) for L(ij) < 0 z(i,j) = 0 for L(i,j) = 0 for each (i,j), i = 0,1,2,...,7 and j = 0,I,2,...,7.

.DTD:

The expression L(,) refers to the group of luminance transform coefficients frcma block andes--equivalent to usingthe notation L(i,j) for all i and all j.

.DTD:

A(0,0) must always be added. The present invention also covers here the est L(0,0) > 0. is not Made and steps (6) and (8) above (option!!y) simplified.

.DTD:

in practice, smallmu_ip_lca_on,," " 4 e.c.. -!! < L(i,j) <!i should be recognized as special casas to save the computational expense of a multiply.

.DTD:

9.2 (If convenient to reduce the cost of the semiconductor apparatus, right-shift the numbers E(i,j) by an arbitrary number !0 !5 of positions S1. Note that these shifts are "free" in some implementations of the method. In implementations where the shift is not free, we may choose to omit it when E(i,j) is zero. Or we may choose to eliminate all shifts by setting S1 = 0.) 9.3. Once again in the tvo dimensional form, prepare the values F(0, j) = E(4, j) + EC0, j) F(4, j) = E(0, j) -E(4, j) ?(2, j) = Db E(6, j) + Eb E(2, j) F(6, j) = Db E(2, j) - Nb E(6, j) F(I, j) = Da E(7, j) + Na E(!, j) F(7, j) = Da - E(l, j} - Na E(7, j) F(3, j) = Dc E(5, j) + Nc - E(3, j) F(5, j) = Dc E(3, j) - Nc E(5, j) H(0, j) = F(0, j) + F(2, j) H(I, j) = F(4, j) + F(6, j) H(2, j) = F(4, j) -F(6, j) H(Z, j) = F(O, j) -F(2, $) HC, 5) = FCV, j) --C5, 3) HS(j) = F(7, j) - F(5, j) H6(j) = F(!, j) - F(3, j) H(5,j) = H6(j) + H5(j) H(7, j) = FC!, j) + FC3, j) for j = 0,1,2...., 7 9.4 Prepare the values G(i, 0) = H(i, 4) + H{i, 0) G(i, 4) = H(i, O) -H(i, 4) GCi, 2) = Db H(i, 6) Nb H(i, 2) G(i, 6) = Db H(i, 2) - Nb H(i, 6) !0 G(i, i) = Da - H(i, 7) Na - H(i, i) G(i, 7) = Da H(i, I) - NA H(i, 7) G(i, 3) = Dc H(i, 5) Nc H(i, 3) G(i, 5) = Dc H(i, 3) - Nc H(i, 5) M(i, o) = G(i, O) + G(i, 2) M(i, I)= G(i, 4) + G(i, 6) M(i, 2) = G(i, 4) - G(i, 6) M(i, 3) = G(i, 0) - S(i, 2) MCi, 4) = GCi, 7) -GCi, 5) MS(i) = G(i, 7) + G(i, 5) M6(i) = G(i, I) - G(i, 3) M(i, 5) = M6(i) - m5(i) M(i, 6) = M6(i) + M5(i) M(i, 7) = G(i, i) + G(i, 3) for i = 0,1,2...,7 9.5 Multiply each M(i,j) by (if i = 0, 2, 3 or 4:) Nrc' if j = 5 or 6 Drc' if j = 4 or 7 ! (no action) if j = 0, I, 2 or 4" (if i = 4 or 7:) Drr' Nrc' Drr" DRC' Drr' if j = 5 or 6 if j = 4 or 7 if j = 0, i, 2 or 3 (if i = 5 or 6:) Nrrt Nrc' Nrr' Drc' Nrr, if j'= 5 or 6 if j = 4 or 7 if j = 0, i, 2 or 3 9.6. Prepare the values i0 !5 zci, 0) = MCi, 0) + M(i, 7) zci, i) = H(i, 1) + M(i, 6) z(i, 2) = M(i, 2) + M(i, 5) z(i,) = (i, 3) + M(i, 4) Z(i, 4) M(i, 3) -M(i, 4) Z(i, 5) = M(i, 2)-M(i, 5) Z(i, 6) = M(i, i) -M(i, 6) Z(i, 7) = M(i, 0) - M(i, 7) for i = 0, 1,2...,7 9.7. Prepare the values Y(0, j) = Z(0, j) + Z(7, j) Y(I, j) = Z(1, j) + Z(6, j) Y(2, j) = Z(2, j) + Z(5, j) Y(3, j) = Z(3 j) + Z(4, j) Y(4, j) = Z(3, j) -Z(4, j) Y(5, j) = Z(2, j) -Z(5, j) Y(6, j) = ZCl, j) -Z(6, j) Y(7, j) -z(o, j) -Z(7, j) for j = 0,1,2..... 7 !0. After step 9.7 in each image suDblock and for each of the 64 locaions (i, j), prepare the value S1 - S v(i, j) = Y(i, j) 2 where S and S1 are the arbitra.--y integers defined in steps (7) and (9.2) above Again, the multiplication is actually a right- shift, o iio Depending on system particulars it may now be necessary to perfo_m range checking For example if the admissible range of liminance is 0 ≤ v(i j) ≤ 255 then values of V(i, j) less than zero or reater than 255 should be replaced with 0 and 255 respectively.

.DTD:

* The values v(i, j) are the now reconstructed image luminance values.

.DTD:

i0 5.2 Discussion of Secondary Processes It is usual to supplement the primary process with additional measures to improve compression or image quality.

.DTD:

".5 After step (!0), image accuracy can be improved by iterating through all pixel pairs V(SI + 7, j), v(8I + 8, j) and all pixe! pairs V(k, SJ + 7), v(i, gJ + 8), that is to say adjacent pixels which were separated into separate image blocks) and respectively incrementing and decrementing their values vl, v2 by, for example (v2 - vl) / max (2, ll sr.t (M)) where M is the rate scaler used in step (4) and where the expression in the denominator is again just a convenient approximation to optimality.

.DTD:

Before perfo.ming step (6), the subjective difficulty of the local image area can be classified, perferentia!ly into one of three types, single-, double-, or quadruple-precision with the code preface "0", 'i0', or '!l output respectively. The computation in step (6) is now replaced i%h P-s (i, j) LCi, j) = GCi, j) -KCi, j) 2 where p 0, I or 2 for single-, double-, and quadrupleprecision respectively. This is later compensated in step (9.2) where the added precision must be removed with an (increased) right-shift.

.DTD:

i0 Unfortunately no very effective simple classification scheme has been found. We currently use a cumbersome scheme which derives the difficulty measure P from four sources:

.DTD:

a) P_ left and P_up, the difficulty measures of neighboring image areas, b) su/u(i+j)G(i,j)'2) / sum(G(i,j)'2, the transfo_--m energy skew c) -G(0,0), the inverse mean luminance, and d) max(sum_over_fixed_width(Histoqram(v(i,j)))), the uniformity.

.DTD:

In step (7), the transform data L(,) to be stored or transmitted can be further reduced using an entropy encoding method We use and recommend an elaboration of the CCITT zigzagrun-and-template code with several default Huffman tables depending on the bit rate. For definiteness we detail an example of such in the following section.

.DTD:

@! 5.3 xsmDe Compressed Fe Fomr A compressed image is represented by I) Preface (image width, height, rate scaler M, etc.) Pixel Block 0 Pixel Block 1 Pixel Block 2 Pixel Block N-! 3) Postface (if any) where each Pixel Block is represented by !5 z) ) Precision Code (as determined by optional step Z) DC Coefficient Delta Code AC Coefficient Code (repeated zero or more times) End-of-Block Code where each AC Coefficient Code is represen:ed by l) Nine-Zero Extension (Repeated E times, E O) 2) Run-template code denoting (R,T) 3) Sign of Coefficient value (i bit) 4) Absolute value of Coefficient vith MSB deleted (T bits) where R+(E is the number of zero-valued coefficients preceding this one in "zigzag" order (a sequence based on the sum i+j) and where T is the bit-position of the most-significant bit (MSB) of I0 !5 he absolute value of the coefficient, for example T = 3 when the coefficient is ll or -ll: bit position: 876543210 iI= 000001011 (binary) --most significant bit We will not detail a choice or encoding of the DC coefficient Delta, but we do give an example Huffman code useful at higher bitrates for the AC run-and-Template Codes.

.DTD:

Code R Oxx 0 w 100x 0 4+w lllll0 0 6 illlll0{0} 0 7+n i010 1 0 16110 1 1 !01!i 2 0 l!OOXX i+w max(O,2-w) !IOIO{O}IXX!+w n+!+max(O,2-w) !!0!!xx 5 0 !I!I00{0}>!XX lw n+!+max(0,2-w) ll01!xx 5 0 ll!100{0}'>lxx %+w l+n !i!!I!! iiii01!Ii0 = Reserved = Nine-Zero Extension = End-of-Block Code where {0} xx x denotes n consecutive zeros, n=0,!,23..o denotes 2 bits interpreted as w=0,1,2, or 3 denotes i bit interpreted as w=0 or! 5.4 12B-poin nd 256-Poinz Transfos The foregoing method can be used with a larger Generalized Chen Transform, 8-by-16 or 16-by-16. The method for further generalizing Chen's Transform should be clear upon noting that the iD 16-point GCT is given (with the rows in "butterfly order" and without the necessary normalizing post-multipliers) by GCT_16 (a, b, c, e, f, g, h, r, s, t) = I0 GCT_S(a, b, c, r) GCT_(a, b, c, r) I GQ(e,f,g,h,r,s,t,) GQ(e,f,g,h,r,s,t) I where GCTS(a, b, c, r) = I1 1 1 1! 1 i 11 11 -i -i i i -i -i II I b! -! -b -b - i bl I b 1 -i -h -h -i 1 hI I a ar+r ar-r! -i r-ar -r-ar -a i r-_ c -c r-cr cr+r -" I I l cr-r c I c r-cr -c-_-_- -_! cr+r c_-_- -c I I 1 r-at ar+r -a a -=- r ar-r -i [ and where GQS(e, f, g, h, r, s, t) = e es+t rs(e-1)+n(e+l) r(e-1) r(e-1) rs(e+l)+n(1--e) -s+e: 1 I -s-ht rs(h-1) +n(-l-h) r(h+l) tO-h) rs(-l-h) +n(1-h) hs-t h l g s-gt rs(-g-1) + rt(g-l) r(-l-g) rO-g) rsCg-l) +rt(g+l) gs+t 1 1 -Is + t rs(f+ 1) + rf-1) r(-f-1) r(f-1) rs(f-i) +n(-1-f) -s- f I f s+ft rs(f-1) +n(-f-1) r('l- f) r(-f-I) rs(-f-1) +rtCi-f) -fs+t - 1 -gs-t rs(g-1)+rt(g+I) r(g-1) r(-t-g) rs(g+l)+n(1-g) s-t _ -g h -hs+t rs(-h-t) +rt(l-h) r(h-1) r(h+l) rsCl-h) +rt(h +1) -s-ht -1 I 1 s-et rs(e+l) + nO-e) rO-c) r(c + 1) rs C1--e) + rt(-1-e) es +t - I i0 Here, the "true cosine" parameters are g = tangent 15pi/32 ==i0.1532 a = tangent 14pi/32 == 5.0273 f = tangent!3pi/32 == 3.2966 b = tangent!2pi/32 == 2.4142 g = tangent I!pi/32 ==!.8709 c = tangent I0pi/32 == 1.4966 = tangent 9pi/32 == 1.2185 r = cosine 8pi/32 == 0.707! t = cosine 12pi/32 == 0.3827 s = cosine 4pi/32 = t b The parameters we use are e=10 a=5 f=3.25 b=2 o 4 g=l. 875 c=!. 5 h=l. 25 r=!7 / 240.708333 r=17 / 240 708333 t = 5 / 13 == 0.384615 s=t b=12/!3 The inverse of GQ8(e, f, ge h, rs s t) is the transpose of GQ8 (e, f, gs h, i/2r t" b; t') where b = s / t t' = i / (t+tbb) i0 xamp!e Hat?ces Transpose of the Matrix TP The Co,ine Transform (a = 5.02734 b = 2 41421 c m 1.49661 r - 0.70711):

.DTD:

0.1758 0.1768 0.1768 0.1768 0.1768 0.1768 0.1768 0.1768 0.2452 0.2079 0.1389 0.0488 -0.0488 -0.1389 -0.2079 -0.2452 0.2310 0.0957 -0.0957 -0.2310 -0.2310 -0.0957 0.0957 0.2310 0.2070 -0.0488 -0.2452 -0.1389 0.1389 0.2452 0.0488 -0.2079 0.1768 -0.1768 -0.1768 0.1768 0.1768 -0.1768 0.1768 0.1768 0.1389 -0.2452 0.0488 0.2079 -0.2079 -0.0488 0.2452 -0.1389 0.0957 -0.2310 0.2310 -0.0957 -0.0957 -0.2310 -0.2310 0.0957 0.0488 -0.1389 0.2079 -0.2452 0.2452 0.2452 -0.2079 0.1389 Relaed Chen Transform (a E 5.0 b = 2.4 c = 1.5 r = 0.7) 0.1768 0.1768 0.1768 0.!76B 0.1768 0.1768 0.1768 0.1768 0.2451 0.2059 0.1373 0.0490 -0.0490 -0.1373 -0.2059 -0.2451 0.2308 0.0962 -0.0962 -0.2308 -0.2308 -0.0962 0.0962 0.2308 0.2080 -0.0485 -0.2427 -0.1387 0.1387 0.2427 0.0485 -0.2080 0.!768 -0.1768 -0.!768 0.1768 0.1768 -0.1768 -0.1768 0.1768 0.1367 -0.227 0.0485 0.2080 -0.2080 -0.0485 0.2427 -0.1387 0.0962 -0.230B 0.2308 -0.0962 -0.0952 0.2308 -0.2308 0.0962 0.0490 -0.!37 0.2059 -0.2451 0.2451 -0.2059 0.1373 -0.0490 6.0 Description of ADDaratus .DTD:

Now that a detailed discussion of the present invention has been provided, apparatus incorporating aspects of the present invention " w__ now be described.

.DTD:

Through the following, "point" is used to denote a scaler register or data path of arbitrary precision, typically 8 to 12 bits. A method for dete_mining appropriate precision is known [6].

.DTD:

In the software method, transform stages are combined and the WuPaolini enhancement was adopted. For the semiconductor apparatus of the preferred embodiment it is more convenient simply to provide two 8-point transform units, one each for the vertical and horizontal senses.

.DTD:

It is necessary o provide a 64-poin shift array between the vertical and horizontal transforms, and similar buffering between the transform section and the coding section.

.DTD:

!5 Although the present invention includes a monochromatic apparatus and/or separate apparatuses for compression and decompression, a preferred embodiment (Fig. 7) includes both a compressor (Fig. la) and decompressor (Fig. lb) which operate on tricolor datE.

.DTD:

Data is admitted to the compressor (Fig. 2a) in vectors of 8 pixels which are further arranged in a lexicographic order into blocks of 64 pixe!So Processing of blocks is pipe!ined (Fig. 2b) o A pixel input to the compressor comprises "R" (red); "G" (green) and "B" (blue) sealers. These are immediately transformed to a luminance- chrominance space. (The reasons for such a transform are well known.) lO The transform can use arbitrary fixed or programmable coefficients (Fig. 3a) or can be "hard-wired" to simple values in a dedicated application (Fig. 3b). The transform space is denoted here as XYZ, but any linear form of the tricolor input may be used, perhaps the CCITT standard: (Y, R- Y, B-Y). The three values X, Y and Z are then each, in effect, delivered to separate monochrome compressors. The decompressor uses the same or similar circuitry as in Fig. 3, except that now an XYZ vector is transformed to an RGB vector.

.DTD:

The values Y, X and Z are then input to three shift registers (Fig. 5) to await delivery to the first transform unit. The transform unit operates in (2 + 2/3) pixel times, so some of the data must be delayed as sho%rn. The labeling "XYZ" is a bit unfortunate; optimized coding methods require that luminance ("Y") be processed first.

.DTD:

During decompression, the XYZ skew problem is reversed. Note that 5 points of registers are saved in the preferred embodiment by reversing the usage of Y- and Z- shift registers during decompression.

.DTD:

i0 "-5 Referring to Fig. la, the major divisions of the compressor include an input secZion (i, 2) which transforms the input to XYZ space and buffers it for subsequent transfer to the transform unit (3). For each eight pixel times, the transform i unit must be cycled three times (once for each of X, Y and Z data). The output of transform 1 is placed into the shift array (4) where it is retained until the Sx8 pixel block has been completely read. The transfo 2 unit (5, 6) operates on the previously read pixel block, again cycling hree times in each eight pixe! times, and provides data to the Coder Input Buffer (7, S). The Coder (9,10,11) is also shared among the three color coordinates, but an entire luminance block will be coded without interruption, followed by each of the chrominance blocks. If the processing of these three blocks cannot be completed within 64 pixel times, timing and control logic will hold up the pixel clock to the external input circuiryo I0 The storage areas (Input Shift Register [2], Shift Array [4], and Coder Input Buffers [7,8]) must be triplicated for the three colors, but the computation units [3,5,6,9,10,11] are shared (time-multiplexed) among the Y-, X- and Z-data.

.DTD:

The Coder [9,10,11], the Coder Input Buffer [7,8], code programming [12, 13,14] and timing and control logic [not shown] may follow existing art or practice. Similarly, the method for time-multiplexing three colors through a single circuit is well known. The 3-point transfo.m section [!] (Fig. 3) and Shift Registers [2] (Fig. 5) are also known.

.DTD:

!5 The scaler [6] (Fig. I) uses the Quantization multiplier invention described below. This has a straightforward realization. Given the definition of the Generalized Chen Transform and the appropriate parameters, design of the 8-Point Transfo.mer (Fig. 8) is also straightforward.

.DTD:

The Shift Array (Fig. 6A) merits special discussion. Vertical (transformed) vectors from the current input pixe! block are assembled while horizontal vectors from the previous pixel block are delivered to the horizontal transfo.--mer. Without special design, this would require 128 registers (64 for each of the current and previous block) since the points are used in an order different from the order received. However, this need is eliminated by shifting the data left-to-right during even numbered pixel blocks and top-to-bottom during odd-numbered pixel blocks The described shift array is two-directional. shift array is preferred in some embodiments.

.DTD:

A four-directional i0 Fig. 6B shows in more detail the shift array aspec of Fig. 6A. In Fig 6B, vectors are removed from the shift array at the bottom, one by one, and sent to the CT8 [5] section of Fig. IA. In the meantime, vertical vectors from the other DCT8 section are being input to the shif array at the top. Gradually, the old vectors are removed from the shift array and the shift array will completely fill up with vertical vectors from the next pixel block For the next pixel block, the direction of data flow will now be 90 degrees different from the direction of data flow in a previous pixel block. In that way, the horizontal vec:ors will be removed a the right of the shift array and sent to the DCTS and new vertical vectors will come in at the left. Going on mo block N + 2, another 90 degree rotation reverts back to the original form, and so Ono The decompressor (Fig.!b) has z structure quite similar to the compressor (Fig. la) except that the direction of data flow is reversed In a preferred embodiment, a single apparatus operates in two modes, either as a compressor or a decompressor.

.DTD:

i0 Possible VLSI layouts (Figs. 4a, 4b) lead to different data flows for compression (Figs. 4c-a,b) and decompression (Figs. 4d-a,b). Other data flows are possible such as the pipelined implementation described in Section 8.0 below. Note that the operation of the transform and shift array units has the same directional sense for both compression and decompression in one layout (Fig. 4b) but not the other (Fig. 4a). This is seen more clearly when the combined compressor/decompressor data flow (Fig. 7) is considered. When the two transform units are associated (Fig. 4a) with RGB and compressed dat respectively, layout difficulties result unless a four-directional shift array is used. Hence, we associate (Fig. 4b) with the two transfo_'-m units respectively with the input nd output sections of the shift array.

.DTD:

In one embodiment, the transform unit used in the compressor (Fig. 8A) utilizes 38 adders. Shifting right by one ("RI"), two ("R2") or four ("R4") positions, or left by one ("Ll") position is done easily. The circuit depicted uses the parameters (a,b,c,r) = (5, 2.4, 1.5, 17/24). A realization with b=2.5 would reaire only 36 adders in another embodiment (Fig. 8B).

.DTD:

A related circuit is required for the inverse transform uni in the decompressor. With careful use of 'output enable' signaling, it is possible to reuse mos of the adders in the forward transformer. The realization of this is straightforward for someone skilled in the art.

.DTD:

The scaler uses a programmed RAM or ROM, and a system of implicit shifting, multiplexers and adders. This has a straightforward realization.

.DTD:

The descaler can be realized in various ways, preferably a small hardired multiplier with RA/, accumulator, timing and control logic and small template cutoff. In a dedicated low-cost application, the desca!er can be simplified by noting that deblurring weights are near-optimal across a broad range; hence, simple scaling can be used as in the scaler.

.DTD:

The descaler can be located either between the coder and its output buffer, or between the output buffer and a transformer, as indicated in Figs 1 and 7.

.DTD:

The coder input buffer can be realized in various ways, including a cyclesharing register reduction arrangement similar to the shift array. A more straightfo'ard design uses 384-by-i0 bit RAM with a 64-by-7 bit ROE to provide the R2M addresses An example of a cycle of operation will now be described in connection with Figs. iA and lB.

.DTD:

In Fig. iA, data enters the compressor as tri-color information, red, green and blue. it is imediately transformed to an alternate space, called XYZ. The three elements, X, Y and Z, each enter their own shift register.

.DTD:

l0 Prom the shift register (Step 2) they go to an 8-point DCT unit. There could either be one S-point DCT unit, which is multiplexed among the three colors, X, Y and Z, or they could each have their own individual DCT 8-unit.

.DTD:

z5 Information then enters the 64-point shift array (4) There is an individual shift array for each color. From the shift array, block 4, it goes to another DCT unit, block 5, which is similar to block 3. The information then has to be scaled, which is an additional layer of added shift.

.DTD:

The information is only transformed both horizontally and vertically. The shift array actually conceptually rotates the data 90 degrees, so that it can now be transfo.med in the other direction. After the data is scaled, it goes into another buffer, denoted blocks 7 and 8 (Zl and Z2) to hold the data so that it can eventually be encoded and output from the chip (ZI, Z2 equals zigzag).

.DTD:

S l0 Conceptually, this is like the shift array, block 4, except now the data is not being rotated 90 degrees. Instead, it is being transmuted into the zigzag order, which is traditionally being used for these Khings, and is used by the CCITT standard. The information is then presented to the run and template control unit block 9, which will detect zeros and create runs for the zeros, detect non-zeros and an estimate of the logarithm of the value, which is called the template The combination run and template is looked up in a RAM or ROM, called the RT code, and is then output from the chip.

.DTD:

The mantissa, which is the significant bits of the transform coefficients, are also output from the chip. Since the mantissa and run and template code will be arbitrarily long, one bit, two bit, whatever, and the output from the chip will always be 16 bits, or 8 bits, 32 bits, whatever, block II (alignment) facilitates that The other blocks shown in Fig. 1A (optional) programming blocks 12, 13 and 13, which respectively allow you to set an arbitrary RGB 9 XYZ transform, arbitrary rate scalers and psychoadaptive weights, and an arbitrary modified Huffman code for the run and template.

.DTD:

Fig. 1B is very similar to iA. The run and template code now has to be decoded into a run and template combination and the necessary number of zeros has to be omitted.

.DTD:

In Fig. IA, the scaler is a simple array of adders and shifters. In Fig. iB, the descaler is implemented as a very small multiply in hardware.

.DTD:

Figure 9 shows a diagram for a non-pipelined implementation of the twodimensional Generalized Chen Transform. A pipelined implementation is described in Section 8.0 below. Pixels come in at the top and are typically 8 bits wide. The pixels pass through a wide array of adders in the horizontal transform i0 with a data width of typically 128 bits. The output from the horizontal transform I0 passes through a transposition RAM 12 for rotating the information from horizontal to vertical. The daa then passes into the vertical transfo,--m 16 which again comprises only adders (typically 128 bits wide). The output coefficients are finally reduced to a width of roughly sixteen bits and pass through a single multiplier 20, which according to the present invention is JPEG compatible.

.DTD:

Figure l0 shows a block diagram for a VLSI implementation according to the present invention. In Fig. 10, data comes in at component 40 and is latched into the input latch 42, pass through multiplexer 44 into the first half of the GCT transform 50 (which is an adder network as shown in Fig. 8. The second half of the adder network 60 is to the right of the midstage latches 54. The output is passed through MUX 62 to the transposition FJM 66 where the horizontal to vertical transfornation is made.

.DTD:

The output of the transposition RAM, 66 is fed background to the first stage of the GCT 50 in order to form the first half of the vertical transform in a time sharing or time slicing arrangement.

.DTD:

_0 The output of GCT 50 is fed to the input of the second stage of the vertical transform, 60 Finally, the output of GCT 60 is taken through the output latch multiplexer 70 and is passed through Multiplier 74 and Rounder 76 to the zigzag order-arranger 80, the output of which is passed out as a twelve bit coefficient 84.

.DTD:

Still referring to Figure!0, the inverse transformation process according to the present invention ill now be described briefly:

.DTD:

In Fig. i0, the!2 bit coefficients are input through Block 84 to the Y input of the Zigzag Order 80 The output of Zigzag Order 80 goes through Multiplier 74 and Rounder 76 which perfo.--med an inverse uantization process similar to that performed in the forward process. The output of Multiplier 74 is input to Latch 42 which is the first stage of the inverse transformation process.

.DTD:

From Latch 42, the inverse transformation process follows the same twostage time multiplex path that the forward process followed. The outputs appear at the Output Latches 70 the output of which are pixels that are rounded by Rounder 76, whose output is fed to Block 40 for output.

.DTD:

l0 7.0 The Ouamtiza=ion Multiplier Invention To compress the amount of data to be encoded, the Zreqency domain coefficient&, F(i,j), are divided by the quantization value Q(i,j), a positive integer, and rounded to the nearest integer. (Note that Q(i,j) is used in this section to denote the aantizaticn matrix, in contrast with previous sections.) Conversely, the inverse operation requires a multiply by Q(i,j). Large quantization values provide the most compression but result in the greatest degradation in the quality of an image (as measured by the mean squared error (MSE)). Small quantizaticn values do not provide as much compression but do produce a smaller MSE.

.DTD:

The cuantization factor Q(i,j) may be combined with the matrices of steps C, D and E of table l, denoted here as the forward scaling matrices Sf(i, j). Similarly the inverse oZ the quantization factor may be combined with the matrices of steps H and I, denoted here as the inverse scaling matrix S(i,j).

.DTD:

Therefore the forward transfol-m involves the application of Sf/Q (indices deleted for convenience), and the inverse transform involves the application of SiQ.

.DTD:

!0 since the forward operation is a divide, an inverse relation exists between the magnitude of Q and the mathematical resolution of S. For computational efficiency, integer divisions are generally performed by a multiplication and a shift. For instance, in 16 bit arithmetic, division by the integer k would be more expediently performed by a multiplication by 21/k = 65536/k, followed by a right shift by 16 bits.

.DTD:

In the inverse ransform, because of the multiplication of Q and Si, an inverse relation exists between the ranges of Q and Si, and therefore an inverse relation exists between'the range of Q and the resolution of the product, in the JPG Baseline System the cuantizaion values are!i bits unsigned. Thus the largest possible quantization factor is 1023 or 2mo If the multiplication is performed in 16 bit arithmetic, S has a range of 24. For small Q values the resolution of S is the predominant contribution to the MSE.

.DTD:

Host modern computers, microprocessors, and special Digital Signal Processing chips have 32 bit (32b) multiplication, more than enough to solve this problem if used properly.

.DTD:

Ir IS For high speed dedicated hardware it is desirable to use the same multiplier for both foz-ward and inverse transfornms. For "real-time" speed (at or above 30 megacycles for video imagery) a 16b multiplier is about the most resolution feasible. Larger multipliers take more silicon and run slower. Some JPEG Transfo_--m chips use a Discrete cosine Transfo. m instead of the Generalized Chen Transform and do not have a need for scaling and pre-sca!ing i.e. Sf and Si. On the other hand, many DCT implementations do require the type of scaling the GCT calls for.

.DTD:

Please note, however, that for reasonable ESE it is necessary to have access to most of the 32b ouzput. For the fo'ard mode, division is achieved by scaling a nber do% from a large noa!ized number. It is necessary to take the result from the high order bits of the output. For the inverse, the numbers are multiplied, therefore a small normalized nber is desirable. is necessary to take the result from the low order bits of the output. The combination allows little or no reduction in multiplier hardware, such as trimming unnecessary bits.

.DTD:

It I0 Perfo-nance suffers greatly if the multiply is restricted to 16 bits, as is the case in the cross-referenced parent application for the Generalized Chen transfo.--m, serial number 07/511,245 (see the Discussion of Performance section below). Specifically, in the inverse transform, the the range of Q competes with the range of si. The resolution of S, is most important when the quantization value is low, since high quantization numbers add so much distortion that the resolution of the multiply is insignificant.

.DTD:

7.2 Description of the mvention ode goal of the present invention is to provide the maximum performance for both the fo_ward and inverse quantization with a 16 bit hardware multiplier, ioeo in 16 bit arithmetic. This re cuires a balance between range and resolution.

.DTD:

7. Yor'ard Sca!!nc and Quan=!zaton In the foard mode, empirical results show a 16 bit hardware multiplier can provide enough resolution. The largest value (St x 216) can be chosen to be is (216-1). Large Q reduce the range of values of (Sf/Q x 216), but the error due to the lack of resolution of this number is small compared to the error introduced by the quantization.

.DTD:

By properly scaling the input as well as Sf/Q, the output appears in the upper N bits of the multiplier output, i.eo result = (input Qfactor) >> N; where ">>" denotes the shift right operation, and Qfactor = Sf/Q 216.

.DTD:

The multiplication of the two N bit factors generally produces a 2N bit product Note that since the result is taken from the upper 16 bits of the hardware multiplier it is possible to trim gates that deliver the lower 16 bits. All that is needed from the lower N bits is the relevant carry term.

.DTD:

This is illustrated in the two mode hardware implementation depicted in Fig. i!. When performing the forward transform the forward input (/Forward) is low (i.e. zero). Therefore the control multiplexor (MUX) i00 sends the ground (GND) signal of zero to the 16 to 1 MUX 104. This produces no shift (L0) to the input fed to B0-BI5 of the multiplier. The zero signal on the /Forward lead to he multiplexor 108 sends the four bits of Qexponent to inputs A0-A3 of the 16 bit by 16 bit signed multiplier 106. In this case Qfactor - (Qmantissa << 4) + Qexponent, and the multiplier 106 produces the product QfactorInput. The output Result is equal to (QfactorInput >> 16) since 16 least significant digits are discarded as an unused word.

.DTD:

i0 !5 The present invention of this continuation-in-part application involves a process which aids the inverse dequanization by arranging a tradeoff between range and resolution which allows the highest accuracy for!6 bit operations. Empirically it was decided that approximately 12 bits of resolution is necessary for the desired MSE. Since i0 bits are needed for the quantization in the JPEG baseline specification, 24 bits of range is needed. This is achieved by using the highes 12 bits of the 16 bit factor as a mantissa and the lowest 4 bits as an exponent term of the base of two The combination of the 2' possible shift values and the (16-4) bit mantissa gives an effective range of effective range = [(16-4) + 24] bits = 28 bits.

.DTD:

As sho%n in the two mode hardware i=plementation of Fig. ii, in the inverse mode when /Forward is high the control input to the 16 to 1 multiplexor 104 is Qexponento A control input of value i to the multiplexor produces a left shift of i digits (Li) zo the input value The 12 bits of Qnantissa are fed o inputs A4-AI5 of the multiplier 106o The high control value from /Forward to the multiplexor 108 sends zeros from ground (GND) to bits A0-A3 of the multiplier 106. Note once again that the output Result resides in the upper 16 bits of the multiplier output and the!6 least significant digits are discarded as an unused word. The result is therefore governed by result = ((inpu << Qexponen) Qscaler) >> 16; where and Qscaler << Qexponent = Si x Q x 2, Qscaler < 2z2, 0 < Qexponent < (24 - l).

.DTD:

!r Since the input value is left shifted, it is necessary that the input be restricted. Othe.ise the value will overflow and produce spurious results. This is done implicitly, however, by the fact that these numbers have been quantized by the same factor now used in a multiply. This is the reason the present invention is not generalizable to any multiply.

.DTD:

An enhancement to the invention is sho% in Figure!2 where a left shift by multiplexor!i0 occurs after the multiplication step by the 16 blt by 16 bit multiplier!12. In the foard transform the control input /Forward is low. The zero signal to the control input of multiplexor i14 sends the four bits of Qexponent to inputs A0-A3 of the multiplier ll2. The 16 most significant digits (Q31-Ql6) of the 32 bit product of multiplier 112 are selected by the 16 to 1 multiplexor if0 according to the ground signal (GND) from the multiplexor!16.

.DTD:

l0 In the inverse transform the /Fox--ward signal is high. Therefore the ground signal (GND) is sent by multiplexor 114 to inputs A0A3 of the 16 bit by 16 bit signed multiplier 112. Bis Qi-Qj, where i=32-Qexponent and j=i-15, of the 32 bit output from multiplier 112 are selected as the Result according to the value of Qexponent input to the 16 to 1 multiplexor i!0 through multiplexor i16. Since no left shifting of the input value is perfo.med, the input value need not be restricted in range (i.e.

preformatted). In this case the operations are mathematically represented by Result = (input Qscaler) >> Qexponent >> 16.

.DTD:

!5 7.5 Discussion of Perfo!-mance The table below shows the mean square error (MSE) results of an experiment performed on the image Barbara, a 704x576xSb grey level test image from CCITT. The qn/antization va!ues are all ones in the first case and are from the suggested luminance quantization table in the JPEG standard in the second case. The image was processed by a software simulation of a chip using a 32b multiplier, a 16b multiplier, and the present invention implemented with a 16b multiplier, using a 12b mantissa and a 4b exponent as described in the previous section. The results are represented in the table below.

.DTD:

Multiplier MSE Q=I JPEG Suggested Table of Q values Invention 0.121039 34.809839 16b 0.184494 34.822167 32b 0.120354 34.814157 As expected, he major difference between the multipliers occurs when Q is low i.e. when quality reproduction is desired. With less hardware the present invention offers near 32 bit accuracy; the difference in MSE is not visually significant. However, to meet the CCITT Recommendation H.261 transform mismatch compliance test, the present invention must utilize parameter values more closely approximating a discrete cosine transfo.

.DTD:

To implement the 32b multiplier takes about S5% more silicon surface area than the 16b multiplierI, a major consideration for integrated circuit technology. The invenzion adds only 30% more to the area. It is interesting to note that the single multiplier uses about I0% of the silicon.

.DTD:

h_s 1 Estimate based on!.0 um CMOS Standard Cell technology.

.DTD:

is the technology that the HGCT is implemented wih.

.DTD:

8.0 piDe!ined Implementation he GCT Transfov- 8.I Backcround i0 It is desirable to build VLSI chips that perform the international standard image compression system proposed by the JPEG committee of CCITT. Many applications require that the VLSI chip runs at video rates, which implies approximately 8 to 10 million pixe!s per second (depending on resolution). Each pixel is normally comprised of three colors such as red, green, and blue. Because most "/LSI implementations work on one component at a time, the required clock rate is triple the pixei rate. This puts the chip clock rate at apprbx. 25 - 30 LU.z This is a high clock speed, even by the standards of 1991o Most conventional implementations of the DCT use a mix of multipliers and adders to perfo.-m the transform. usually the bottleneck for most implementations. such as RMs and ROMs fo secondary bottlenecks. these bottlenecks, long pipeline architectures are used. A typical pipeline on a DCT chip may be as many as 200 clock cycles, meaning two hundred processes are occurring in parallel inside the chip.

.DTD:

Multipliers are Other functions To overcome Figure 13 shows a conventional pipeline architecture for a discrete cosine transform. Pixe! components arrive at the left hand side of the figure and are latched into the latch device 120 in parallel vectors of size I x 8. These i x 8 vectors are passed to a! dimensional transform 122 to perform the DCT. The I x 8 row vectors are then transposed by the transpose device 124 to column vectors of the form 8 x I. After the transpose, in conventional systems, the transposed vectors are fed to the second DCT unit 126 for transformation. While this second transform is occurring, the first transform unit 122 is kept busy with the next 1 x 8 row vector. Hence the pipeline effect. The final multiply is performed by the multiplier unit 128. Because the DCT is the computational bottleneck for the system, such a structure as above is required to achieve video rates.

.DTD:

Although the Figure 13 has been simplified for clarity it is important to understand the constraints on the total transform. Remember that multiplication operations are the bottlenecks in the system. Because the transfo.-m units 122 and 126 contain multiplications, they and the final multiplier 128 form roughly e_un/al bottlenecks. Let us assume that it re_/ires x nS to perform a single multiplication. In Figure 13 (where the two DOT transform units!22 and 126 are present) each transform unit 122 and 126 performs computation on 8 components simultaneously. Thus, the transform unit has 8x nS to do the computation. This is currently feasible with todays architectures.

.DTD:

!0 !5 8.2 The Present InvenoD The Generalized Chen Transform of the present invention requires no multiplications in the main transform, and only one multiplication per component at the end of the transformation process The main iD GCT consists of any array of 38 adders arranged in a maximum of 7 discrete levels. (See figures 8 and 9) The adder array includes hardwired shifts and can thereby generate multiplications and divisions by factors of two, as discussed above. By further breaking the 7 levels into 2 separate sections (this split is easy due to the simple structure of the GCT), the maximum number of adder levels is reduced to 4. By making such a split, the transform no longer becomes the bottleneck for data flow. It is he final multiplication and RAM accesses that cause the bottlenecks. This means that maximum perfolnnance is controlled by the design, of these elements. However, now that the final multiplication has become a bottleneck, it is possible to utilize an additional feature wih the transform unit. Figure 14 shows such an arrangement.

.DTD:

In Figure!4, following the input latch!30 for 8 x 1 row vectors is a multiplexor 132 that selects one of two inputs to feed to the onedimensional transfol-m circuit 134. Note that he important difference now is that there is only one transfo unit 134o After a preset amount of time the transfo-m unit 134 will have completed the transfo.-m on the incoming row vector After ir passing through the transposition RAM 136, the transposed row vector is passed back by a second multiplexor 138 to the first multiplexor 132 and on to the one and only transform unit 134. Now the columns are transformed. After the columns have been transformed and transposed, the results are diverted to the multiplier 140. Now it is apparent that on average the transform unit must work in 4x nS. This where the simple adder GCT network provides a great advantage. Adders are much faster than multipliers and therefore such time division multiplexing is possible.

.DTD:

!5 The GCT itself is a considerable savings over the DCT The implementation sho in fief/re 14 provides another 50% savings by only having to have! transform unit instead of two. To put this in perspective, the design of the present invention uses only 1 transform unit 134, and that unit occupies between 40 and 50% of the chip. The other 50% goes to RAMs, latches, the multiplier 140, I/O, and so forth. One can see that a second transform unit would increase silicon area by approximately 50%.

.DTD:

8.3 Sua,v The use of an adder-only network ith time division multiplexing provides an efficient --PEG iplementation providing performance 50% greater than video rates.

.DTD:

9.0 Generalizations Although the examples in this disclosure are limited to transform based image coding, the multiplier invention might be generalized to any quantization scheme where the output is multiplied by the same number that the input has been divided by.

.DTD:

i0 ile generalizable to a certain degree because several algorithms use similar quantization schemes, the multiplier invention is only legitimate in the context of quanization and dequantization.

.DTD:

Although the preferred embodiment utilizes 16 bit arithmetic, in qeneral this invention may be applied to such processes utilizing N bit arithmetic Also, the present invention is compatible with existing standards, such as J?EGo The preferred embodiment as chosen and described in order to best explain the principles of the invention and its practical applications to thereby enable others skilled in the art to best utilize the invention and various embodiments and with various modifications as are suited to the particular use contemplated It is intended that the scope of the invention be defined only by the claims appended hereto I.0 RfereDces [I] Acheroy, M. "Use of the DCT for the restoration of an image sequence", SPIE vol 593 Medical Image Processing (1985) [2] Cooley, and Tukey, JW, "An algorithm for (fast) Fourier series", Math Compu_, XIX No 90, pp 296-301, 1965.

.DTD:

!fò[3] Chen, W. et al, "A fast computational algorithm for the DCT", IEEE Trans. Comun. vol COM-25 (1977) [4] Wu, HR and Paolini, FJ, "A 2D Fast Cosine Transform",!EEE Conf. on Image Processing, vol! (1989) !5 [5] Lee, BC, "A Fast Cosine Transform", IEEE ASSP, vol XXXIZI (1985) [6] - Jalali and Rao, "Limited Wordlength and FDCT Processing Accuracy, "!EEE ASSP-81, Vol. I!!, pp. 1180-2.

.DTD:

2%1 [7] Wu, H.R. and Paolini, F.J., "A Structural Approach to Two Dimensional Direct Fast Discrete Cosine Transform Algorithms", International Symposium on Computer Architecture and Digital Signal Processing, Hong Kong, Oct 1989.

.DTD:

.CLME:

Claims

CLAIMS io A data decompression method for inverting a transform which

converts an original sequence of values to a sequence of transform domain coefficients, said method comprising the steps of:

.CLME:

(a) multiplying each of said transform domain coefficients by a Qfactor stored in an N bit storage register in the form of an M bit exponent identified as Qexponent, and an (N-M) bit mantissa identified as Qmantissa, where I0 Qfactor = Qmantissa 2QeÎnent, SO as to provide Qfactor with a range of values greater than 2", and (b) converting the sequence of multiplied transform domain coefficients to a second sequence of values approximating the original sequence of values.

.CLME:

2. The data decompression method of claim 1 wherein the multiplying step includes the steps of multiplying by Qmantissa using an integer multiplier unit, and left shifting by Qexponent bits.

.CLME:

3. The data decompression method of claim 2 wherein the step of left shifting by Qexponent bits follows the step of multiplying by Qmantissa using the integer multiplier unit.

.CLME:

4. The data decompression method of either claim 2 or 3 wherein the step of multiplying by Qmantissa using the integer multiplier unit follows the step of left shifting by Qexponent bits.

.CLME:

5. The data decompression method of any one of claims 2, 3 and 4 wherein the Qfactor is equal to the product of a scaling factor and a quantization factor.

.CLME:

i0 6. The data decompression method of any preceding claim wherein said sequence of transform domain coefficients has a length of L values, and wherein the inverse transform operation is separated into L initial multiplications and a series of additions and shifts in an adder array network.

.CLME:

7. The data decompression method of any one of claims 1 to 5 including the use of an adder array, wherein the inverse transform operation is separated into the product of a diagonal matrix and a non-diagonal matrix, and the operation by the non-diagonal matrix may be performed by said adder array.

.CLME:

8. The data decompression method of claim 7 wherein said adder array has less than 7 stages and is composed of less than 39 adder units.

.CLME:

9. The data decompression method of claim 7 or 8 wherein the transform is a Generalized Chen Transform which - 74 approximates a discrete cosine transform.

.CLME:

i0. The data decompression method of claim 5 wherein said scaling factor includes a psychoadaptive weight factor.

.CLME:

ii. The data decompression method of claim 5 or i0 wherein said scaling factor includes a deblurring factor i0 12. The data decompression method of any one of claims 2 to ii wherein the original sequence of values represents a twodimensional grid of image pixels.

.CLME:

13. The data decompression method of any preceding claim wherein N is equal to 16, and M is equal to 4.

.CLME:

14. The data decompression method of any one of claims 2 to 13 wherein Qfactor is prenormalized by 2" and the N least significant digits of the multiplication product are discarded.

.CLME:

15. A data compression method for performing a transform on an original sequence of values to a sequence of transform domain coefficients, said method comprising the steps of:

.CLME:

(a) converting an original sequence of values to a sequence of transformed values; and (b) converting the sequence of transformed values to a sequence of transform domain coefficients by multiplication i0 of each transformed value by a Qfactor and discarding the N least significant digits of the output, where Qfactor is prenormalized by a factor of 2" and stored in an N bit storage register.

.CLME:

16. The data compression method of claim 15 including a scaling factor and a quantization factor, wherein Qfactor is equal to 2N times said scaling factor divided by said quantization factor.

.CLME:

17. The data compression method of claims 16 wherein the maximum value of all of the Qfactors is (2N-I).

.CLME:

18. The data compression method of any one of claims 15 to 17 wherein said sequence of original values has a length of L values and the transform operation is separated a series of additions and L final multiplications.

.CLME:

19. The data compression method of any one of claims 15 to 18 wherein the transform operation is separated into the product of a diagonal matrix and a non-diagonal matrix, and the operation by the non-diagonal matrix may be performed by an adder array.

.CLME:

20. The data compression method of claim 19 wherein said adder array has less than 7 stages and is composed of less than 39 adder units.

.CLME:

21o The data compression method of claim 19 or 20 wherein the transform operation is a Generalized Chen Transform which approximates a discrete cosine transform.

.CLME:

22. The data compression method of any one of claims 16 to 21 wherein said scaling factor includes an inverse psychoadaptive weight factor.

.CLME:

i0 23. The data compression method of any one of claims 15 to 22 wherein the original sequence of values represents a twodimensional grid of image pixels.

.CLME:

24 The data compression method of any one of claims 15 to 23 wherein N is equal to 16o 25. A two mode apparatus for performing the multiplication of an input integer and a Qvalue to produce a product, comprising:

.CLME:

(a) a multiplexor for left shifting the input integer by a shift integer number of bits; (b) an integer multiplier for multiplying a factor and the shifted input integer; (c) a two mode Qvalue processor, when the processor is in the forward mode the shift integer is set to zero and the factor is set to Qvalue, and when the processor is in the inverse mode the shift integer is set to Qexponent, the factor is set equal to Qmantissa, where 77 Qvalue = Qmantissa 2Qexp nento 26. The apparatus of claim 25 including a compression factor, wherein in the forward mode the Qvalue is inversely proportional to said compression factor and in the inverse mode the Qvalue is proportional to said compression factor, whereby the apparatus functions as a data compressordecompressor apparatus.

.CLME:

i0 27. The apparatus of claim 26 including a 4 bit storage register, a 12 bit storage register, and a 16 bit storage register, wherein Qexponent is stored in said 4 bit storage register, Qmantissa is stored in said 12 bit storage register, and the input integer is stored in said 16 bit 15 storage register.

.CLME:

28. The apparatus of any one of claims 25 to 27 wherein Qvalue is premultiplied by 2N and the N bits are trimmed from the multiplier output.

.CLME:

29. The apparatus of any one of claims 25 to 28 wherein the input integer is a Generalized Chen Transform coefficient in the inverse mode.

.CLME:

30. The apparatus of anyone of claims 26 to 29 wherein the forward mode the Qvalue is proportional to a psychoadaptive weight factor and in the inverse mode the Qvalue is proportional to an inverse psychoadaptive weight factor.

.CLME:

31. The apparatus of any one of claims 26 to 30 wherein in the forward mode the Qvalue is proportional to a deblurring factor.

.CLME:

i0 32 A two mode apparatus for performing the multiplication of an input integer and a Qvalue to produce a product, comprising:

.CLME:

(a) a multiplexor for left shifting the multiplier output by a shift integer number of bits; (b) an integer multiplier unit for performing the multiplication of a factor and the input integer; (c) a two mode Qvalue processor, when the processor is in the forward mode the shift integer is set to zero and the factor is set to Qvalue, and when the processor is in the inverse mode the shift integer is set to Qexponent and the factor is set equal to Qmantissa, where Qvalue = Qmantissa 2Qexnt.

.CLME:

33. The apparatus of claim 32 including a compression factor, wherein in the forward mode the the Qvalue is inversely proportional to said compression factor and in the inverse mode the Qvalue is proportional to said compression factor, whereby the apparatus functions as a data compressor/decompressor apparatus.

.CLME:

34. The apparatus of claim 33 including a 4 bit storage register, and 12 bit storage register and a 16 bit storage register, wherein Qexponent is stored in said 4 bit storage register, Qmantissa is stored in said 12 bit storage register, and the input integer is stored in said 16 bit storage register.

.CLME:

i0 35. The apparatus of claim 33 or 34 premultiplied by 2N and N bits are multiplier output.

.CLME:

wherein Qvalue is trimmed from the 36. The apparatus of claim 35 including three 16 bit storage registers, wherein Qvalue, the product and the input integer are stored in said storage registers.

.CLME:

37. The apparatus of any one of claims 32 to 36 wherein the input integer is a Generalized Chen Transform coefficient in the inverse mode.

.CLME:

38. The apparatus of any one of claims 33 to 37 wherein in the forward mode the Qvalue is proportional to a psychoadaptive weight factor and in the inverse mode the Qvalue is proportional to an inverse psychoadaptive weight factor.

.CLME:

39. The apparatus of any one of claims 33 to 38 wherein in the forward mode the Qvalue is proportional to a deblurring factor.

.CLME:

40. A method utilizing a pipelined architecture for performing a twodimensional transform said method comprising the steps of:

.CLME:

separating said two-dimensional transform into two consecutive one-dimensional transforms wherein the two onedimensional transforms are each separated into a fast stage and a slow stage, said fast stage having a computation time faster than that of the slow stage, and performing the two fast stages of said two onei0 dimensional transforms in one section of the pipeline of said pipeline architecture.

.CLME:

41. The method of claim 40 wherein each fast stage is performed by an adder array network 42 The method of claim 40 or 41 wherein the two fast stages are performed consecutively by substantially the same adder array network.

.CLME:

43. The method of any one of claims 40 to 42 wherein the two slow stages are algebraically combined, and performed as a single stage.

.CLME:

44. The method of any one of claims 40 to 43 wherein said one-dimensional transforms are Generalized Chen Transforms which approximate discrete cosine transforms.

.CLME:

45. An apparatus for performing a two-dimensional transform l which is separated into two one-dimensional transforms, said apparatus having a pipelined architecture comprising:

.CLME:

(a) a first processor in the pipeline of said pipelined architecture, each of said one-dimensional transforms being divisible into a first portion and a second portion, said first processor performing said first portion of said one- dimensional transforms; (b) a transposer in the pipeline of said pipelined I0 architecture for reordering a first set of vectors to produce a second set of vectors, wherein the nth entry of the mth vector of said first set of vectors becomes the mth entry of the nth vector of said second set of vectors; (c) a final processor in the pipeline of said pipelined architecture for performing said second portions of said two one-dimensional transforms; and (d) a routing system, said routing system including means for directing a third set of vectors to said first 20 processor to generate said first set of vectors, directing said first set of vectors to said transposer to generate said second set of vectors, directing said second set of vectors to said first processor to generate a fourth set of vectors, directing said fourth set of vectors to said second processor to generate a set of two-dimensionally transformed vectors.

.CLME:

46. The apparatus of claim 45 wherein said first, second, third and fourth sets of vectors are Mxl vectors of cardinality Mo I0 47 The apparatus of claims 45 or 46 wherein the processing time of said first processor to produce said first set and said fourth set of vectors is not much greater than the processing time of said second processor to produce said set of transformed vectors.

.CLME:

48. The apparatus of claim 47 wherein said first processor is an adder array network.

.CLME:

49. The apparatus of claim 47 or 48 wherein said second processor algebraically combines said second portions of said two one-dimensional transforms.

.CLME:

50 The apparatus of any one of claims 45 to 49 wherein said onedimensional transforms are Generalized Chen Transforms which approximate discrete cosine transforms.

.CLME:

51.

.CLME:

An image compressor comprising transform means for receiving input pixels having a i0 certain bit width and for horizontally or vertically transforming said input pixels using only adder array means in a time sharing arrangement, transposition memory means for rotating the horizontally or vertically transformed pixels to vertical or horizontal, said transform means including means for receiving the vertical or horizontal pixels and means for vertically or horizontally transforming the vertical or horizontal pixels using said adder array means, and single multiplier means for receiving the transformed pixels and for performing a single multiplication function on said transformed pixels to provide compressed pixel data representative of said input pixels.

.CLME:

52.

.CLME:

of, time In an image compressor, the method comprising the steps receiving input pixel having a certain bit width in a sharing arrangement horizontally or vertically transforming said input pixels using only adder array means, rotating the transformed pixels to vertical or horizontal, transforming the vertical or horizontal pixels using only said adder array means, and performing a single multiplication function on said transformed pixels to provide compressed pixel data representative of said input pixels.

.CLME:

53. An image compression system comprising means for receiving input image pixel data representative of an image, Generalized Chen transform (GCT) means for compressing said image data, said GCT means including GCT adder means for horizontally transforming said image data using only adders, transposition memory means for rotating the horizontally transformed pixels to vertical, i0 said GCT adder means including means for vertically transforming the vertical pixels using only said adders, and multiplier means for performing a multiplication function on said transformed vertical pixels to provide compressed pixel data representative of said input pixels, said GCT adder means including first GCT adder network stage for transforming the first half of said horizontal and vertical transformations and a second GCT adder network stage for transforming the second half of said horizontal and vertical transformations.

.CLME:

54 An image compression system in Claim 53 wherein said first and second adder means horizontally and vertically transforms said image pixels in a time sharing arrangement.

.CLME:

55 An image compression system as in Claim 53 or 54 wherein said multiplier means include zigzag order means.

.CLME:

56. An image compression system as in any one of Claims 53 to 55 wherein said multiplier means include rounder means.

.CLME:

57. An image compression system as in any one of Claims 53 to 56 wherein said multiplier means include multiplier table means.

.CLME:

i0 58. A data compression method substantially as hereinbefore described with reference to the accompanying drawings.

.CLME:

59. A two mode apparatus constructed and arranged substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.

.CLME:

60. A method utilizing a pipelined architecture substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.

.CLME:

61.

.CLME:

constructed and arranged described with reference accompanying drawings.

.CLME:

Apparatus for performing a two dimensional transform substantially as hereinbefore to and as illustrated in the 62. An image compressor constructed and arranged substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.

.CLME:

63 An image compression system constructed and arranged substantially as hereinbefore described with reference to and as illustrated in the accompanying drawings.

.CLME: