GB2339989A

GB2339989A - Reduced memory video decoder stores locally compressed decoded pictures

Info

Publication number: GB2339989A
Application number: GB9810769A
Authority: GB
Inventors: Wolfram Keck; Fabrice Bellard; Adrian Philip Wise
Original assignee: LSI Logic Corp
Current assignee: LSI Corp
Priority date: 1998-05-19
Filing date: 1998-05-19
Publication date: 2000-02-09
Anticipated expiration: 2018-05-19
Also published as: GB9810769D0; GB2339989B

Description

2339989 METHOD AND APPARATUS FOR DECODING VIDEO DATA The present invention

relates to method and apparatus for decoding of video bitstreams, particularly although not exclusively encoded according to International Standard ISO/MC 13818-2 and ISO/IEC 11172-2 (commonly referred to as MPEG video).

In accordance with customary terminology in the video art, the term "frame" as used herein consists of two fields, which fields are interlaced together to provide an image, as with conventional analog television. The term "picture" is intended to mean a set of data in a bit- stream for representing an image. A video encoder may choose to code a ftame as a single frame picture in which case there is a single picture transmitted consisting of two interlaced fields, or as two separate field pictures for subsequent interlacing, in which case two consecutive pictures are transmitted by the encoder. In a frame picture the two fields are interleaved with one another on a line-by-line basis.

Pels ("Picture Elements") usually consist of an 8 bit (sometimes 10 bit) number representing the intensity of a given component of the image at the specific point in the image where that pel occurs. In a picture (field-picture or frame-picture), the pels are grouped into blocks, each block having 64 pels organised as 8 rows by 8 columns. Six such blocks are grouped together to form a "macroblock". Four of these represent a 16 by 16 area of the luminance signal. The remaining two represent the same physical area of the image but are the two colour difference signals (sampled at half the linear resolution as the luminance). Within a picture the macroblocks are processed in the same order as words are read on the page i.e. starting at the top-left and progressing leftto-right before going to the next row (of macroblocks) down, which is again processed in left-to-right order. This continues until the bottom-right macroblock in the picture is reached.

2 MPEG video is composed of a number of different types of pictures denoted as (a) I-pictures (Intra pictures) which are compressed using intra picture coding and do not reference any other pictures in the coded stream; (b) P-pictures (Predicted Pictures) which are coded using motioncompensated prediction from past I-pictures or P-pictures; and (c) B-pictures (Bidirectionally Predicted Pictures) which provide a high degree of compression and are coded using motion-compensaed prediction from either past and/or future I-pictures or P-pictures.

Various techniques have been used in the past to decode NIPEG bitstreams using less memory than a conventional implementation (which requires sufficient storage for three entire frames of video).

These can be summarised as:

1. "Two and a half frame decoders". These require two entire framestores, (each framestore contains uncompressed data for both luminance and chrominance from which predictions are formed) and in addition a further half framestore (or so) in order to deal with the B-pictures. These schemes are only applicable to 625-line television systems (in 525-line systems 3 framestores are still used, but since each requires less memory than in a 625-line system the total memory used is similar to that for 2. 5 625line framestores).

2. "B-frame on the fly decoders". These use only two framestores to store the "anchor frames" from which predictions are formed. B-pictures are not stored in an external memory device, instead they are converted from the inherent block structure (in which they are decoded) to a raster (suitable for display) in an internal memory buffer. This method is the subject of LSI's pending applications GB 9726145.7 and GB 9716293.7.

3. An approach involving compression of stored picture data - see "A new approach for memory efficient ATV decoding", H. Sun et al, Mitsubishi Electric ITA, EEEE Transactions on Consumer Electronics, Vol. 43, No. 3, August 1997, p. 517.

3 Approaches I and 2 are fundamentally limited by the requirement to store two entire framestores to enable predictions to be made. They cannot operate using less memory than required by the two framestores, and in practice require somewhat more than this for other uses such as the elementary stream channel buffer.

Approach 3 uses local compression for the storage of anchor frames. However, the approach taken is to use inherently lossy compression giving the problem of accumulation of mismatch errors and frame oriented partitioning not suitable for interlaced sequences.

SUMMARY OF THE 19VENTION

It is an object of the present invention to provide a video decoder with reduced memory requirements.

The present invention provides in a first aspect a video decoder for decoding encoded video pictures, the decoder including storage means for storing decoded video pictures, and compression means for selectively locally compressing the decoded video pictures prior to storage, and decompression means for decompressing the stored compressed pictures prior to display or use in decoding, wherein the local video compression means includes means for selecting either a first intra compression algorithm, or a second intra compression algorithm for compressing the video pictures prior to storage.

In a further aspect, the present invention provides a method of decoding encoded video pictures, comprising storing decoded video pictures for decoding other video pictures, and compressing the decoded video pictures prior to storage and decompressing the stored video pictures for display or for use in decoding, wherein the selected video pictures are selectively compressed according to a first intra compression algorithm or a second intra compression algorithm.

In accordance with the invention, depending on the particular circumstances the pictures may be compressed according to a first lossless algorithm or a second lossy algorithm.

4 The compression algorithms employed are intra schemes, i.e. schemes which do not reference data in adjacent pictures; this is to be distinguished from the compression schemes for transmitting NTEG pictures, which normally compress data representing the difference between adjacent pictures and are therefore inter compression schemes.

For the purpose of the present specification, the term lossless compression means:

Whenever lossless local compression is used, the pixel data after consecutive compression and decompression is identical to the data without any local compression, i.e. they do not differ at all. That means the local compression scheme does not introduce any alterations, pixels remain mathematically identical. Lossless compression includes in a limiting case, no compression at all.

Lossless local compression faces the constraint that it cannot guarantee that a given compression ratio can always be achieved, i.e. there are always image regions where lossless compression (to a given amount of data) simply is not possible. Therefore, lossless local compression is only well suited for small compression ratios (typically below 2: 1).

For the purposes of the present specification, the term lossy compression means:

Whenever lossy local compression is used, pixel data after consecutive compression and decompression is similar to, but not identical to the pixels without any local compression. Therefore, in this approach pixel values are actually changed by the local compression and decompression scheme. However, the mathematical difference usually is small enough to be not perceptible by a human viewer.

Lossy local compression allows a trade-off between picture quality and compression ratio achieved. Therefore, in general any given compression target can be met. Compression ratios of up to 4:1 do not normally result in visible degradations (though the pixel values are not identical) even for demanding examples. Only for higher compression ratios artefacts introduced by the local compression start to become visible.

In general, any type of lossy or lossless algorithm can be used. As preferred, the lossy algorithm comprises a discrete cosine transform (DCT), followed by quantisation of the DCT coefficients - it is at this point that loss is introduced. The quantisation may be subject to adaptive rate control. The quantised coefficients, being in a twodimensional matrix, are scanned into a run level code string, followed by Huffman encoding.

As preferred, the lossless algorithm comprises predicting the current pixel from adjacent pixels and subtracting this prediction from the true value. The differences are encoded according to a Huffman encoding scheme.

The present invention has particular application in two specific areas:

Decoding of HDTV material. This is because the amount of memory required for HDTV is much larger than for conventional standard definition television so that the motivation to reduce the amount of memory becomes stronger.

Use of "embedded DRAM" in a VLSI circuit for video decoding. Conventionally VLSI circuits are used in conjunction with separate memory devices connected externally (to the decoder VLSI). As DRAM technology evolves the amount of RAM required for a video decoder will become just a small fraction of the memory which is provided by a single commercially viable DRAM chip. It will therefore become sensible to embed the DRAM required for decoding onto the same VLSI chip that does the decoding. However, in this case the amount of DRAM required has a large impact on the die-area of the chip and this in turn provides a strong motivation for reducing the amount of DRAM. Of course similar advantages result from the invention for any other type of embedded memory.

BRIEF DESCRIMON OF THE DRAWINGS A preferred embodiment of the invention will now be described with reference to the accompanying drawings wherein:- Figure 1A and 1B show schematically a standard form of MTEG encoding and decoding, and a form of local intra compression.

Figure 2 is a schematic diagram of a video decoder embodying a principle incorporated into the present invention for compression prior to storage, and decompression prior to display and forming predictions; 6 Figure 3 is a schematic block diagram of a preferred embodiment of the compression part of the invention; and Figure 4 is a schematic view of the pixel arrangement for prediction purposes in lossless compression.

DESCRIPTION OF THE PREFERRED ENMODMIENT

Referring to figure 1A, the MPEG video encoder contains within it a decoder which it uses to (effectively) decode the bitstream which it transmits. It uses these decoded pictures to form predictions, and then transmits the differences between these predictions and the picture it is currently transmitting. The video decoder forms the same predictions and adds this to the decoded image.

In order for the decoder to get pictures which are correct, it is necessary for the predictions formed in both the encoder and the decoder to be identical. So in Figure 1A, the data at points "A" and "B" should be the same.

If as shown in figure IB the video decoder is modified to incorporate local compression prior to storage and subsequent local decompression then the data at point "B" will not be the same as at point "A", (unless the compression is lossless).

There will be divergence between the pictures decoded by the decoder and those decoded by the decoder in the encoder. Since the decoded pictures are used to predict subsequent pictures the errors can accumulate which will cause progressively worse and worse pictures to be displayed at the decoder. The success of a decoder employing compression in the decoder will depend on the degree of compression desired, the time between data being sent intra-coded (without reference to a prediction from a previous picture)and the local compression used.

Referring now to figure 2, this shows a block diagram of a video decoder 2 operating similar to a conventional decoder, wherein at least three frames are stored in an external memory 4. Encoded video data is fed to a packetized elementary stream (PES) parser 6 and then via a memory controller 8 to a channel buffer 10 in memory 4. The buffered data is read by a decoder unit 12 which outputs the decoded picture information for storage via a local compression unit 14. Unit 14 includes a local encoder 7 16 for compressing data prior to storage in a framestore 18. This fiamestore to be written into is either the display framestore of a three framestore decoder (as shown) or a separate decoder framestore in case of a four framestore decoder. Unit 14 also includes decompressing units 18 for selectively decompressing data from previously decoded forward and backward prediction framestores 20, and a decompressor 22 for decompressing the display framestore data 18. Before decoding a new picture the role of the different framestores may have to be swapped (i.e. the old display ftamestore becoming the new forward reference fi- amestore and the old forward reference becoming the new backward reference).

As video data is decoded it is compressed at 16 so that the data written to the external memory occupies a smaller amount of memory than the uncompressed video data. The stored video data is read back, both to form predictions for decoding later pictures, and for display. In each case the compressed data read from the framestore is decompressed. The circuitry in the "video decode" and "video display" need not be modified from that used when no local compression is employed.

Figure 2 happens to show three local decoders; one for each of the forward and backward predictions and a further one for the video display. This is purely for illustration purposes. It may be that in a real implementation a lesser number of decoders might be time-multiplexed between the various decoding tasks. (Conversely a greater number of decoders running in parallel might be required to achieve the required performance).

1. Cbmp-r-ession Schem Ideally the picture data recovered by the local decoder will be identical to the picture data which went into the local encoder. In this case the WEG portion of the decoder (in "video decode" in Figure 1) operates precisely as it would do in the case that no local compression was used. The resulting image sequence will be identical to that produced by a conventional decoder (having the same numerical IDCT) and will comply with the WEG standard. A local compression scheme which achieves this is referred to as "lossless".

8 In practice, it is not possible to use a lossless algorithm in all cases. Instead a lossy compression scheme is employed in which the data recovered by the local decoder(s) will not be identical to the picture data which went into the local encoder. This inevitably leads to a divergence between the pictures decoded by a decoder employing lossy local compression and one which properly complies to the WEG standard. Since the prediction data is used to decode a picture which may itself (after a further local compression/decompression) be used to predict a subsequent picture (and so on) it is possible for errors to accumulate. If this occurs the pictures decoded by the decoder employing lossy local compression will become progressively degraded.

In all current (particularly) broadcast applications of NTEG the accumulation of errors is effectively controlled by the introduction of periodic intra pictures which are not predicted from previously decoded pictures. This is typically done twice per second. If this is done then it is found that in practice the accumulation of errors due to lossy compression is acceptable, in the sense that it leads to few visual artefacts.

In order to achieve the highest possible picture quality a scheme is proposed in accordance with the invention which adaptively selects between lossy and lossless compression.

Referring to Figure 3, there is shown a schematic block diagram of the encoder unit 16 of Figure 1. The output of the video decoder 12 is applied in parallel to a lossy encoding device 40 and a lossless encoding device 41 whose outputs are applied to a multiplexer 42 for selecting an output representing compressed data for storage in memory unit 4.

The lossy encoding device 40 consists of a lossy encoding algorithm unit 43 (doing the actual compression), a lossy local buffer 44 (holding the compressed data) and a lossy buffer fullness control algorithm unit 45. The lossy buffer fullness control algorithm unit is used to adjust the lossy encoding according to the buffer fullness. It is either possible to enforce that the amount of data needed for any compressed picture region is below a given target size or to implement a true rate control algorithm to achieve the target size only on average.

9 The lossless encoding device 41 consists of a lossless encoding algorithm unit 46 doing the actual compression and a lossless local buffer 47 holding the data resulting from lossless compression.

A loss/lossless decision unit 48 decides whether lossy or lossless encoding is to be applied for a specific picture region i.e. what alternative of 40 and 41 is to be selected by the multiplexer 42 and to be written into the memory unit 4. The decision can be based on a comparison of the amount of data for lossy and lossless compression for a specific picture region and the target amount of data for any region.

2. The Amgunt of Compression The amount of compression depends on the target size of the framestores required. This value will not be fixed by the implementation, but will be programmable. This will allow a range of possibilities. For example, a PCB, might be produced which allows for different amounts of memory to be employed. A manufacturer might use a small amount of memory in products which can accept lower picture quality, while My populating the board with memory for applications requiring higher picture quality, and accepting higher cost.

The degree of compression may be dynamically alterable (by application software), allowing a trade-off between image quality and the amount of memory available for other tasks.

It should also be pointed out that there is no need to use the same amount of storage for each of the framestores. In particular, the B- pictures may be compressed more heavily because they are not used to form predictions for subsequent pictures, but are simply displayed.

By way of example a number of numerical examples are discussed:

2.1 32 Wit three framestore ATSC Solution The ATSC (Advanced Television Systems Committee) standard uses a maximum image resolution of 1920 pels by 1088 lines. Allowing for both chrominance and luminance information this requires 25 067 520 bits (23. 91 Mbit) to represent an uncompressed single framestore.

The ATSC standard also requires a video channel buffer of about 8.71 Mbit (for operation at up to 19.4 Mbit/s). Since this cannot be reduced by compression this must be allocated. Additionally, a practical decoder will require extra memory for other uses such as audio channel buffers, SI tables, application CPU data storage and On-ScreenGraphics. Allowing a little over 5 Mbit for these other functions it can be seen that something of the order of 14 Mbit will be required for non-framestore usage, allowing about 18 Wit for the framestores.

If this is divided equally between the framestores in a three framestore decoder there is about 6 Wit for each one requiring nearly 4:1 compression.

At this compression ratio it is clear that lossy compression win be used most of the time, so there will be some significant degradation of picture quality. However, this is found to be acceptable, particularly where the decoded pictures are down-sampled for display at standard resolution.

2.2 64 Mbit three fi-amestore ATSC Solution With 64 Mbit there are approximately 50 Wit left for the framestores requiring a compression ratio of less than 3:2 in the case of a three framestore decoder.

At this level, lossless compression will be selected much of the time and there is much less picture quality degradation. The decoding and display of full high definition ATSC pictures gives very acceptable picture quality with 64 Mbit.

2.3 72 Wit RDRAM three framestore ATSC Solution RDRAM may be available with 72 Wit in a single package (this is 64 Mbit technology level but with 9 bit rather than 8 bit bytes).

Since both the video channel buffer data, and the locally compressed framestore information can easily use the ninth bit of the bytes (since each is effectively a bitstream) it becomes possible to utilise the greater memory capacity.

In the interests of a fair comparison it is assumed that the CPU-related information does not utilise the ninth bit (except perhaps for memory error detection and correction) so the 5.29 Mbit allowed in the sections above is scaled up by 9/8 and 5.95 Mbit. This leaves 19. 11 Mbit for each fi-amestore, requiring compression to just 80% of the non-compressed size.

At this level the lossless compression will be selected almost all of the time. "Me lossy compression is selected very rarely for the unusual regions which do not compress with the lossless scheme, and which will occasionally be encountered in real video material.

This will give excellent results which are visually indistinguishable from the results of decoding without local compression and which are often mathematically identical to the results expected by the NTEG standards.

2.4 64 Wit four framestore ATSC solution As in 2.2, "64 Mbit three framestore ATSC solution" there are approximately 50 Mbit left for all fmmestores together. This amount of data now has to be shared between four framestores leading to about 12 Wit per framestore and requiring about 2: 1 compression.

Now lossless compression will be selected less frequently than in the scenario 2.2, 1164 Mbit three framestore ATSC solution" but still much more often than in the scenario 2. 1, "32 Mbit three fi-amestore ATSC solution". Even full high definition display gives very acceptable results, almost as good as in case 2.2, "64 Wit three framestore ATSC solution".

3. Unencoded Region Si For efficient local intra compression several pixels have to be grouped together and thus are only accessible as units. Since the locally compressed data is used to form predictions it is necessary to be able to decode just a small region of the entire reference picture independently of the remainder of the image.

There is a trade-off to be made between having a relatively large region size (which gi ves better compression, i.e. better picture quality for a given amount of data) and a small region (to minimise the amount of data which must be read in order to make a prediction).

12 Clearly, all pixels to be compressed together should stem from just one macroblock to simplify the write back to external memory during NTEG decoding. If a WEG sequence does not only consist of progressive pictures (ie is not a progressive sequence) the partitioning used should be identical for frame and field pictures (that may refer to each other within a sequence in any order conforming to MPEG).

Therefore, for interlaced picture material the maximum possible region size is half a macroblock, consisting of 16 pel by 8 lines of luminance (y) along with the corresponding 8 pel by 4 line areas of the 2 subsampled colour difference signals C#Cb)All pixels in such a l6x8 region always contain data from just one of the two interlaced fields.

In the case of a frame picture all NTEG decoded macroblocks then have to be split up into two l6x8 regions to be locally compressed, each containing only lines from one of the two fields (ie only even or odd lines of a frame macroblock). In a field picture each macroblock still produces two 16xg regions, this time one above the other in one of the two fields. Therefore, the partitioning of a locally compressed picture is identical in both cases, thus enabling the use of the stored l6x8 regions for predictions in later frame or field pictures (independently of the present picture structure).

The l6x8 region size represents the maximum common arrangement for frame and field macroblocks. However it is possible to further divide the common l6x8 region, ie to use smaller sub-regions stemming from a single field. Tests show, that regions consisting of only 16 pel by 4 lines of luminance and the corresponding 8 pel by 2 line areas of C4Cb (ie vertically splitting each l6x8 region in two l6x4 regions) compress almost as well as 16x8 regions but clearly help to reduce the memory bandwidth needed for forming predictions and display.

Therefore, 16 x 8 or preferably 16x4 unencoded region sizes can be used and will be considered in all the following sub-sections.

When decoding a progressive sequence (in which there are no interlaced frames and no field pictures at all) the scheme should preferably be modified so that each l6x8 region or 16x4 region comprises adjacent lines in the frame, rather than the alternate

13 lines of the frame (which constitute adjacent lines in the field) that are normally used. This situation is analogous to the case of decoding a field picture in an interlaced sequence since each decoded macroblock gives two l6x8 regions or four l6x4 regions to be locally encoded, one above the other.

In both cases the 16x8 or 16x4 regions are locally compressed independently of each other and therefore are the smallest units to be accessed for predictions. Lossy/lossless decision and lossy rate control is also based on those l6x8 or l6x4 regions.

4. Coded Region Si The coded region size (the size of the compressed representation of a l6x8 or 16x4 region) needs to be controlled.

Clearly the total of the coded region sizes for a frame needs to be less than the amount of memory allocated for that framestore. In fact, it is actually a benefit to Emit the allocation such that a smaller area than the entire framestore has a constant size. Typically, it is desirable to constrain the amount of memory allocated forone half of a field store to a fixed amount. This allows a very simple re-use scheme whereby the two halves of a field store may be reallocated (once the field has been displayed) for decoding either the upper and lower halves of another field (in the case of a field picture) or the upper half of both the top and bottom fields of another frame (in the case of a frame picture).

However, it is also desirable to constrain the amount of data which may be allocated to a given l6x8 or 16x4 region in order to Emit the memory bandwidth which will be required to form predictions. If a very wide range is allowed for the size of a coded region then it may happen that the majority of macroblocks of a subsequent picture call for predictions to be made from those l6x8 or 16x4 regions which happen to have been allocated a large number of bits. This would lead to an unsustainably large memory bandwidth requirement.

4.1 Constant Coded Region-S 14 One option is to simply insist that all of the coded regions have the same size as one another. This actually works very well in the case that the compression ration is quite modest (for example the scenario described in Section 2.2 - 64 Mbit ATSC Solution).

The advantages of this approach are:

1. There is no need for an index table to locate the coded representation of each coded region (see Section 5 - Indexing Coded Regions). The memory that would otherwise have been used as an index table can instead be used to increase the amount of memory available to each region.

2. The lossy/lossless coding decision becomes extremely simple because the data produced by the lossless compression algorithm will either fit in the memory available for that region, or it will not.

4.2 Variable Coded Region Siz Where a greater compression ratio is required (for example the scenario described in Section 2.1 - 32 Mbit ATSC Solution) the constant coded region size does not lead to an optimum solution.

In this case, it is necessary to be able to allow those areas of the image which code more efficiently to use a smaller than average coded region size in order to be able to allocate more than the average number of bitsto those regions which do not compress well.

The principle advantage of this scheme is that it leads to a better memory utifisation since no space need be wasted. (Conversely in the constant coded region size case the tendency is for easier to code regions to only use some of the memory allocated to them, the remainder remaining empty). Although it is usually convenient to align the start of a coded region to a byte or word boundary (as opposed to starting on any bit) in order to reduce the amount of information stored in the index table.

Furthermore, this scheme distributes the potential loss of picture quality due to lossy compression more evenly within a picture and thus helps avoiding visible artefacts in less efficient to compress parts of strongly compressed pictures.

However, there are several practical problems to be overcome:

1. The memory bandwidth required to make predictions needs to be bounded, requiring a rate control algorithm which will limit the maximum number of bits it will allocate to a region. - 2. It must be possible to locate the start of each coded region in memory in order that they can be decoded independently of one another.

5. Indexing Coded Regions Whenever a variable coded region size is used some form of index table needs to be maintained in order to locate the start of each coded region. (Only in the case of a fixed coded region size the start of each region can be located using arithmetic alone).

One important parameter of any indexing scheme is its granularity. Even though bit-alignment would lead to optimum memory utilisation it would result in prohibitively large indexing tables and is not always feasible for a given memory architecture. Therefore, eg alignment to byte boundaries can be used as a reasonable compromise between memory utilisation and code table size. It leads to an average waste of 3.5 bits ((1+2+3+4+5+6+7)/8) per l6x8 or 16x4 region. For an ATSC maximum size picture (consisting of 16320 l6x8 regions or 32640 16x4 regions) this results in an average of 57120 (16x8) or 104240 (16x4) wasted bits and therefore is negligible even for only 6 Mbit per framestore (as in Section 2.1 - 32 Mbit ATSC Solution).

Even in the case of byte-oriented addressing within a framestore the index tables reach considerable sizes, e.g. for the case of framestores of 16 Mbit (see Section 2.2 64 Mbit ATSC Solution) 21 bits 0092(1610241024/8)) are needed to index one single l6x8 region within a framestore. This results in an index table size of 342 720 bits per frame, in case of l6x8 regions.

To further reduce the size of the index tables combinations of arithmetic and modified, smaller tables can be used. One potential approach for example, is to store only the difference between the position in case of fixed coded region size and the actual position for variable coded region size.

6. The Lossy Compression Schem 16 As pointed out in Section 1 Compression Scheme, a lossy compression scheme always has to be used, if lossless compression cannot achieve the target compression ratio (either fixed for any single region in case of constant coded region size or only on average for variable coded region size).

In general, any lossy compression scheme can be used as long as it offers an opportunity to control the amount of data per unencoded region size (and thereby the resulting error due to lossy compression) as pointed out in Section I - Compression Scheme. In addition, the selected compression scheme has to perform well for the chosen unencoded region size (e.g. l6x8 luminance pixels) and must not refer to information outside a single compression region.

All schemes tried in our experiments were transform based as those schemes have proven to perform particularly well for intra coding small image regions and are easy to implement. In such schemes the amount of data can always be controlled by appropriate, frequency selective quantisation of the resulting transform coefficients.

To improve coding efficiency (at the cost of higher implementation complexity) the more advanced schemes for still image compression under discussion in JPEG-2000 (as wavelets or TCQ) could be used instead.

6.1 Partitioning of Unencoded Regions Each region to be locally compressed contains (non-predictive) luminance and chrominance pixels of WEG decoded pictures. For example, the unencoded regions used throughout 0 our tests either consist of l6x8 or 16x4 luminance pixels (Y) and the spatially corresponding 8x4 or W pixels of the two subsampled chrominance difference signals (Cr, Cb). Those three components are always compressed separately, leading to at least three partitions (immediately following each other in compressed form).

In principle, each of those three partitions can be further split up. For chrominance components that option never led to improvements and 8x4 or 8x2 transforms seem to be optimal. For the luminance component a hirther partitioning in two horizontally adjacent W or 8x4 regions is an interesting alternative to using one single l6x8 or l6x4 region. The use of 8 pixel wide regions does not only reduce the 17 implementation complexity of the transform but also takes into account that a similar horizontal partitioning has been used during the original NTEG encoding (blocks of the MTEG syntax). In fact it turns out that 8x8 or 8x4 transforms (especially DCT) even lead to slightly better results than l6x8 or l6x4 transforms (using optimised other parameters).

Therefore, it is advisable to use 8x8 transforms for the luminance component and 8x4 transforms for the chrominance components in case of l6x8 unencoded regions and 8x4 transforms for luminance and 8x2 transforms for chrominance in case of l6x4 unencoded regions.

6.2. Transform The most obvious idea in an RTEG related context is to also use discrete cosine transforms in case of l6x8 regions (8x8 DCT for luminance and 8x4 DCT for chrominance in case of 16x4 regions - 8x4 DCT for luminance and 8x2 DCT for chrominance) in the local compression. Then settings for other parameters can easily be adopted from MPEG.

Nevertheless, any other transform is possible if the other parameters are optimised and not simply taken over from MTEG. We tried Haar transforms (HT; as a simple Wavelet) and Walsh Hadamard transforms (WHT).

Even for optimised parameters it turns out that DCT performs better than HT and much better than WHT. For scenarios needing considerable local compression (e.g. Section 2.1 - 32 Mbit ATSC Solution) only DCT leads to visibly pleasant results (HT being about ldB worse) even for complicated examples (with error propagation over many generations due to NTEG predictions). Only for smaller compression ratios (e.g. Section 2.2 - 64 Mbit ATSC solution) or less complicated examples simpler to implement transforms seem possible.

6.3 Quantisation As mentioned above quantisation of transform coefficients is used to control the amount of data for each region. Similar to MPEG the overall quantisation is split up into two parts. The first part is frequency selective but fixed and corresponds to the 18 quantisation matrix of WEG. The second part is actually used to adapt the quantisation of a region thus enabling to adjust the resulting amount of data. It is similar to the quantiser scale of WEG.

In general, the quantisation matrix to be used depends on the transform and the size of the region to be transformed. In case of W DCT, it is reasonable to directly take over the intra quantisation matrix of the MPEG standard. For all other DCT sizes the simplest approach is to use the NTEG intra quantisation matrix as a starting point. For example, for the 8x4 DCT a suitable 8x4 matrix can be derived by simply averaging every two lines of the original NTEG matrix and scaling the resulting 8x4 factors to preserve the relative quality weighting between luminance and chrominance as for NVEG. - As an alternative to using matrices directly derived from WEG, a scheme to compute a matrix by minimising the mean square error between quantised and original coefficients for a target amount of data on a set of training sequences was tried. It results in different quantisation matrices but only leads to negligible quality improvements when used in a complete DCT scheme. In general it seems there is no sharp optimum for the quantisation matrices.

For all other transforms this approach of minimising the mean square error between quantised and original coefficients was used to derive suitable quantisation matrices.

To be able to control the resulting amount of data for a region of the local compression scheme the corresponding quantisation matrix is scaled by a quantiser scale. The same scale is used for all partitions of a locally compressed region. Its fixed length code is stored together with the entropy coded data of the quandsed coefficients maldng independent decoding of any region possible. If the value range covered by the quantiser scale is not sufficient to achieve a given target data amount high frequency coefficients can be removed in addition to quantisation.

The way the quantiser scale is determined for every locally compressed region depends on whether constant or variable coded region sizes are aimed at. For constant 19 region sizes the basic idea simply is to increase the quantisation scale until the resulting amount of data after entropy coding is below the target amount per region. This could be realised as a bank of parallel quantisers and entropy coders (together with a lossless encoder as pointed out in Section 4.1 - Constant Coded Region Size).

For variable coded sizes each quantiser scale is determined by a rate control based on a local buffer fullness, i.e. an accumulated deviation from the average region size available. As a simple approach for the buffer control a modified version of the algorithm described in the NTEG test model 5 (TM5) has been used. For improved results the complexity of each l6x8 or 16x4 region can also be taken into account.

All coefficients are quantised using linear quantisers. Decision levels have been slightly shifted towards zero (17/32 instead of 1/2). This helps to reduce drift due to multiple local compression and decompression during decoding of predictive NTEG pictures. Also a check has been added to avoid increasing the value range of a coefficient during quantisation with a small overall factor (i.e. multiplication instead of division).

6.4 Scan The quantised coefficients have to be re-ordered from the two-dimensional transform arrangement into a one-dimensional arrangement suitable for efficient entropy coding. Entropy coding itself is realised similarly to NTEG or JPEG, i.e. by coding non-zero coefficients together with the number of leading zero coefficients (run level coding) and terminating each coefficient sequence with an end of block symbol. Therefore, a suitable scan of coefficients has to be such that frequently non-zero coefficients are scanned first and the point from which only zero coefficients occur is reached early in the scan.

Apart from the special case of a progressive sequence (see Section 3 Unencoded region size) all lines of a region always stem from a single field and thus show stronger correlation in horizontal than in vertical direction. Then, the alternate scan of MTEG-2 (that was specifically designed for such situations) performs better than zig zag scan and is used. Only for progressive sequences zig zag scan is superior. Based on the optimum scan for M DCT corresponding scans for 16x8, 16x4, 8x4 and 8x2 DCT can be derived. For example, the scan for the 8x4 DCT can be obtained by simply omitting the scan positions from the odd lines of the 8x8 arrangement.

For WHT exactly the same scans as for DCT have been used as the coefficients have a similar physical meaning. As the two-dimensional HT corresponds to a successive sub-band decomposition it is best to use a scan order that contains the low frequent sub-bands first. Within a single sub-band the order of the coefficients is of no significance.

6.5 En"o y Coding As pointed out in Section 6.4 - Scan, entropy coding is based on common run level encoding and terminating each compressed block with an end of block variable length code (VLC).

Exactly the same code tables for the local compression are used as for the WEG decoding itself. For 8x8 DCT, it is known from MTEG-2 intra encoding that VLC table 1 generally achieves higher compression than VLC table 0. Therefore, table 1 has been taken over for the local compression scheme in a first approach. For simplification and as no basically different probabilities are to be expected the same VLC table is also used for 8x4 chrominance regions.

To be able to independently access single 16x8 regions the differential DC encoding of WEG has to be given up. Instead, in our approach an absolute, 9 bit fixed length code is used for each DC coefficient. Differential encoding of DC coefficients could only be used for the DC of the second 8x8 or 8x4 luminance block of a l6x8 or l6x4 region (using the first absolute DC as predictor). Then, as in the case of the statistically non-evenly distributed chrominance DCs, variable length coding would slightly reduce the amount of data for the DCs.

To further improve entropy coding optimised tables were tried. It turns out that for scenarios with substantial compression (e.g. Section C - 32 Mbit ATSC Solution) the MTEG table is almost optimal and only minor improvements (below 3 %) are possible. Only for very modest compression (e.g. the Section D - 64 Mbit ATSC Solution) 21 optimised tables lead to clear improvements of up to 20% (as N2EG itself has not been optimised for such low compression ratios). Because it seems impossible in terms of hardware complexity to automatically adapt the VLC table used (e.g. by collecting statistics from a previous picture) a reasonable compromise could be to implement two fixed tables, one aiming at high compression ratios (e.g. MPEG VLC table 1) and another for low compression ratios. Even the use of only one table would not have severe consequences as for low compression ratios a frequent use of lossless compression is to be expected.

7. The Lossless Compression Schem Lossless compression should be used, whenever it allows the target data amount set for a coded region to be achieved. Then picture data is not changed by local compression and decompression i.e. no drift occurs at all. Especially in scenarios with modest overall compression (e.g. Section 2.2 - 64 Mbit ATSC Solution) a frequent use of lossless compression is likely and leads to optimum picture quality.

Basically, any scheme used for lossless still image compression can be used here as well. However, the scheme for local compression faces the constraints, that it has to be applied to relatively small image regions (16x8 or 16x4, luminance and 8x4 or 8x2 chrominance pixels) and may not refer to any information outside of that area. The three colour components (Y, Cr, Cb) are compressed separately and not split any further in smaller partitions.

The scheme used in our experiments is based on a combination of intra field (or in case of progressive sequences intra frame) prediction and subsequent entropy coding of the prediction error. Because of the small image regions used (leading to many exceptions in prediction at the borders) and to simplify implementation no context modelling is used.

7. 1 Prediction To enable efficient entropy coding every pixel of a losslessy compressed region is predicted by pixels (of the same component) preceding in scan order. For that prediction 22 a predictor similar to JPEG-LS is used whenever the pixels needed are available from within the present region.

The general pixel arrangement for this predictor is shown in Figure 4. To predict the current pixel (x) up to three direct neighbours (a, b, c) can be used. If all three neighbours are available the predictor also takes edge information into account. At the upper and left border of a region not all those three neighbours are available. The first pixel in the first row is coded as an absolute 8 bit value. The other pixels in the first row use only a as a predictor. The pixels in the first column are predicted by the corresponding value b.

The scheme implemented leads to slightly smaller code sizes than always using only pixel as a simple predictor (apart from the first column).

7.2 En"o Y C-o-d-ing Whenever it is possible to form a prediction (i.e. apart from the first pixel of any component) only the prediction error between the current pixel and its prediction is stored. Because the statistical distribution for the prediction error is centred around zero and big errors are very unlikely entropy coding leads to an overall data reduction.

In our experiments a Huffman coding scheme as in the old JPEG lossless approach was used. It uses a combination of a variable length code to indicate the magnitude range of the prediction error and a fixed length code to indicate its exact value within the range. In general, it was found that it does not matter much on what material the VLC table was trained.

8. DaM Organisation in Mem= After describing all the coding principles used in local compression this section summarises the structure of a coded region in memory.

Every coded region starts with a flag signalling whether the following data resulted from lossy or lossless compression. In our initial experiments this decision was always based on a comparison of the amount of data of lossy and lossless compression. Better results can be obtained if the decision is an integral part of the buffer control and also takes the average compressed region size into account.

23 In case of lossy compression the quantiser scale needed for inverse quantisation during decompression is to follow next. Then coefficient data of the two partitions of the luminance component and the two chrominance components of a region are put into memory. For each component at least the first DC coefficient is coded as an absolute value while the other coefficients are stored as run level events terminated by an end of block symbol.

For lossless coding the flag signalling the compression type is immediately followed by data for the three colour components. Each component starts with an absolute value followed by differential variable length codes.

If variable coded block sizes are used an index table as described in Section 5 Indexing Coded Regions is needed in addition to the actual data.

9. Advantage of the-Invention There are several advantages in the entire scheme:

1. Switching between lossy and lossless compression on a region by region basis. The advantage of doing so is that the algorithm automatically adapts to the amount of data available for a fi-amestore and the complexity of the picture material. In case of modest local compression and/or easy to compress picture data lossy compression can be seen as a fallback mode of the lossless scheme used most of the time. For more substantial local compression the lossy mode automatically becomes the standard mode. Nevertheless, lossless compression is used whenever possible to achieve maximum picture quality.

2. Use of regions made up of either l6x8 luminance and 8x4 chrominance pixels or 16x4 luminance and W chrominance pixels (by further vertical sub-divisions) from a single field in the local compression scheme whenever a sequence does not only contain progressive pictures (progressive sequence). The advantage of such an arrangement is that locally compressed regions of any anchor frame or field can be accessed similarly easily for predictions in frame and field pictures. That is a clear advantage over using regions corresponding to complete WEG macroblocks (made up of 16xl6 luminance and M chrominance pixels) of either a field or a frame making

24 predictions for the opposite picture structure more complicated as up to 100% of redundant data has to be fetched. The l6x8 luminance and 8x4 chrominance arrangement chosen is the maximum common size of a pure field or frame arrangement.

3. If compressed regions are of variable size the size allocated is constrained to multiples of a certain number of bits. An indexing scheme of appropriate granularity (e.g. byte alignment) has the advantage of still being efficient enough and reducing the size of the indexing tables needed at the same time. It can also contribute to make memory accesses more efficient and thus reduce memory bandwidth.

4. Partitioning of the l6x8 or 16x4 luminance portion of the regions for lossy local compression in two 8 pixel wide sub-partitions. That partitioning has the advantage that it preserves the horizontal block arrangement of MPEG. Therefore, lossy local compression is not further compromised by blocking artefacts resulting from the original MPEG encoding. This advantage outweighs the loss in coding efficiency due to the smaller region size.

5. It is possible to build hardware that handles both constant and variable coded region sizes. This allows a single design to be used in different applications.

The entire scheme has the advantage that it enables to decode predictive coded video with less memory than would be needed for the classical approach of NTEG decoding.

Claims

1. A video decoder for decoding encoded video pictures, the decoder including storage means for storing decoded video pictures, and compression means for selectively locally compressing the decoded video pictures prior to storage, and decompression means for decompressing the stored compressed pictures prior to display or use in decoding, wherein the local video compression means includes means for selecting either a first intra compression algorithm, or a second intra compression algorithm for compressing the video pictures prior to storage.

2. A video decoder according to claim 1, wherein the first algorithm is a lossless compression algorithm, and the second algorithm is a lossy compression algorithm.

3. A video decoder according to claim 2, wherein video picture information is coupled in parallel to a lossy algorithm means and a lossless algorithm means, the output of the lossy algorithm means being applied to a first buffer means and the output of the lossless algorithm means being applied to a second buffer means, the outputs of the first and second buffer means being applied to a multiplexer means, and a lossy/lossless decision switch means for selecting the output of either the first or second buffer means.

4. A video decoder according to claim 3, wherein the switch means is responsive to the amount of data within the first and second buffer means and makes a decision to choose one or other of the buffer means in dependence on the amount of data therein and/or the average region size.

5. A video decoder according to claim 3 or 4, wherein the switch means is presettable so as to select either a lossy or lossless algorithm.

26

6. A video decoder according to claim 3, 4 or 5, including a lossy buffer fullness control algorithm means for adjusting the compression rate of the lossy algorithm means, preferably dependent upon the amount of data in said first local buffer means.

7. A decoder according to any preceding claim, wherein the compression means is arranged to operate on regions of luminance data of l6x8 pixels, or 16x4 pixels.

8. A video decoder according to claim 7, wherein the decoder is arranged to partition each region of l6x8 luminance pixels into two blocks of W luminance pixels or each region of 16x4 luminance pixels with two blocks of 8x4 luminance pixels, for lossy compression.

9. A video decoder according to any of claims 1 to 5, wherein each coded region is compressed to the same size.

10. A video decoder according to any of claim 1 to 8, wherein the size of the compression regions is variable and the storage means includes index table means for locating the start of each compressed region.

11. A video decoder according to claim 10, wherein the size of each compressed region is a whole number of bytes, or other fixed multiple of bits.

12. A video decoder according to any preceding claim, wherein the lossy algorithm means comprises a transform scheme, for example discrete cosine transform.

13. A video decoder according to claim 12, wherein the transform coefficients are quantised using either a quantisation matrix directly derived from the NTEG intra quantisation matrix, or a quantisation matrix derived by off-line optimisation on a set of training sequences.

27

14. A video decoder according to claim 13, wherein the quantisation matrix is scaled by a quantiser scale.

15. A video decoder according to claim 14, wherein the quantiser scale is determined by a buffer fullness control based on the amount of data in the first local buffer.

16. A video decoder according to any preceding claim, including means for encoding and decoding entropy encoded coefficients.

17. A video decoder according to any preceding claim, wherein the lossless algorithm compression means employs intra field or intra fi-ame prediction and subsequent entropy coding of the prediction error.

18. A method of decoding encoded video pictures, comprising storing decoded video pictures for decoding other video pictures, and compressing the decoded video pictures prior to storage and decompressing the stored video pictures for display or for use in decoding, wherein the selected video pictures are selectively compressed according to a first intra, compression algorithm or a second intra compression algorithm.

19. A method according to claim 18, wherein said first compression algorithm is a lossless algorithm and said second compression algorithm is a lossy algorithm.

20. A method according to claim 18 or 19, wherein the picture data is compressed in blocks of l6x8 pixels or 16x4 pixels for luminance data together with the spatially corresponding pixels of the chrominance data.