WO2005022917A1

WO2005022917A1 - Apparatus and method for coding a group of successive pictures, and apparatus and method for decoding a coded picture signal

Info

Publication number: WO2005022917A1
Application number: PCT/EP2004/009053
Authority: WO
Inventors: Detlev Marpe; Heiko Schwarz; Thomas Wiegand
Original assignee: Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Priority date: 2003-09-02
Filing date: 2004-08-12
Publication date: 2005-03-10
Also published as: DE10340407A1

Abstract

For coding a group of successive pictures, a filterbank decomposition is used which comprises a second filter level (12) for producing a first high-pass picture of the second level and a first low-pass picture of the second level from the first and second pictures of the group of pictures, and for producing a second high-pass picture of the second level and a second low-pass picture of the second level from the third and fourth pictures of the original group of pictures. The encoder further includes a first filter level (20) for producing a first high-pass picture of the first level and a first low-pass picture of the first level from the first and second low-pass pictures of the second level, as well as further-processing means (26) for further processing the first high-pass picture of the first level and the first low-pass picture of the first level to obtain a coded picture signal, the first further-processing means including a quantizer which has a quantizer step size. The coded picture signal represents a base scaling layer with regard to temporal scaling. The group-wise picture processing and decomposition into high-pass and low-pass signals further enables SNR scalability implementation.

Description

APPARATUS AND METHOD FOR CODING A GROUP OF SUCCESSIVE PICTURES, AND APPARATUS AND METHOD FOR DECODING AN CODED PICTURE SIGNAL

Description

The present invention relates to video coding/decoding algorithms and, in particular, to video coding/decoding algo- rithms in line with the ISO/IEC 14496-10 international standard, this standard also being referred to as H.264/AVC.

The H.264/AVC standard is a result of a video standardiza- tion project of the ITU-T group of video coding experts

(VCEG) and of the ISO/IEC moving-picture expert group (MPEG) . The main goals of this standardization project are to provide a clear video coding concept exhibiting very good compression performance, and to create, at the same time, a network-friendly video representation including both applications having a "conversation character", such as video telephony, and applications having no conversation character (storing, broadcasting, - stream transmission).

In addition to the above-cited ISO/IEC 14496-10 standard, there is also a multiplicity of publications relating to the standard. Reference shall be made, by way of example only, to "The Emerging H.264/AVC Standard", Ralf Schafer, Thomas iegand and Heiko Schwarz, EBU Technical Review, January 2003. In addition, the specialist publication "Overview of the H.264/AVC Video Coding Standard", Thomas Wiegand, Gary J. Sullivan, Gesle Bjontegaard and Ajay Lo- thra, IEEE Transactions on Circuits and Systems for Video Technology, July 2003 as well as the specialist publication "Context-based adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard", Detlev Marpe, Heiko Schwarz and Thomas Wiegand, IEEE Transactions on Circuits and Systems for Video Technology, September 2003, include a detailed overview of different aspects of the video coding standard.

To facilitate understanding, however, an overview of the video coding/decoαVLng algorithm will be given below with reference to Figs. "9 to 11.

Fig. 9 shows a complete architecture of a video encoder generally consisting of two different stages. Generally speaking, the first stage, which principally operates in relation to video, generates output data which will eventually be subjected to entropy coding by a second stage, designated by 80 in Fig. 9. The data is data 81a, quantized transform coefficients 81b as well as motion data 81c, these data 81a, 81b, 81c being supplied to the entropy encoder so as to generate a coded video signal at the output of the entropy encoder 80.

Specifically, the input video signal is partitioned, e.g. split up, into macroblocks, each macroblock comprising 16 x 16 pixels. Subsequently, the association of the macroblocks with slice groups and slices is selected, whereupon each macroblock of each slice is processed by the network of operation blocks as are shown in Fig. 8. It shall be pointed out that it is possible to efficiently process macroblocks in parallel if there are different slices in a video picture. The association of the macroblocks with slice groups and slices is effected by means of an encoder control block 82 in Fig. 8. There exist various slices which are defined as follows:

I slice: the I slice is a slice wherein all macroblocks of the slice are coded using intra prediction.

P slice: in addition to the coding type of the I slice, certain macroblocks of the P slice may also be coded using inter prediction with at least one motion-compensated prediction signal per prediction block. B slice: in addition to the coding types available in the P slice, certain macroblocks of the B slice may also be coded using an inter prediction with two motion-compensated prediction signals, per prediction block.

The three above coding types are very similar to those of previous standards, however with the exception of using reference pictures, as will be described below. The follow- ing two coding types for slices are new in the H.264/AVC standard.

SP slice: it is also referred to as switch-P slice coded such that efficient switching between different precoded pictures becomes possible.

SI slice: the SI slice is also referred to as switch-I slice, which enables exact adjustment of the macroblock in an SP slice for any direct access and for the purposes of error restoration.

On the whole, slices are a sequence of macroblocks which are processed in the order of a raster scan, unless a property, also defined in the standard, of the flexible macrob- lock ordering is used. A picture may be partitioned into one or several slices, as is shown in Fig. 11. Thus, a picture is a collection of one or several slices. In this sense, slices are independent of each other, since their syntax elements may be analyzed (parsed) from the bit stream, it being possible to correctly decode the values of the scan values in that area of the picture which is represented by the slice, without any requirement for data from other slices, provided that reference pictures used are identical both in the encoder and in the decoder. However, certain information from other slices may be required for applying the deblocking filter beyond slice boundaries. The FMO property modifies the manner of partitioning pictures into slices and macroblocks by applying the concept of slice groups. Each slice group is a set of macroblocks defined by a macroblock-to-slice-group map specified by the content of a picture parameter set and by specific information from slice headers. This macroblock-to-slice-group map consists of a slice-group identification number for each macroblock in the picture, it being specified to which slice group the associated macroblock belongs. Each slice group may be partitioned into one or several slices, so that a slice comprises a sequence of macroblocks within the same slice group that is processed in the order of a raster scan within the set of macroblocks of a special slice group.

Each macroblock may be transferred in one of several coding types, depending on the slice coding type. In all the slice coding types, the following types of intra-coding, which are referred to as intra-_x or i tra-i_6Xχ₆, are supported, a chroma-prediction mode and also an I-P_CM prediction mode being supported as well.

The intra__4x4 mode is based on the separate prediction of each 4x4-chroma block and is suitable for coding parts of a picture having outstanding details. The intra_i_6Xi6 mode, on the other hand, performs a prediction of the entire 16x16- chro a block and is thus more suitable for coding "soft" areas of a picture.

In addition to these two chroma prediction types, a separate chroma prediction is performed. As an alternative for intra__4x4 and intra_i6xi6/' the I-_x coding type allows the encoder to simply skip the prediction as well as the transformation coding, and, instead, to directly transfer the values of the coded samples. The I-_PCM mode serves the following purposes: it enables the encoder to precisely depict the values of the samples. It provides a manner of depicting the values of very abnormal picture content without data magnification. In addition, it enables specifying, for the number of bits, a hard limit that an encoder must have for handling macroblocks without negatively impacting the encoder efficiency.

Unlike previous video coding standards (i.e. H.263 plus and MPEG-4 Visual) , where intra prediction has been performed in the transformation area, with H.264/AVC, intra prediction is always performed in the spatial domain, to be pre- cise by referring to adjacent samples of previously coded blocks located to the left or above the block to be predicted (Fig. 10) . In certain environments, where transmission errors occur, this may lead to error propagation, this error propagation occurring in inter-coded macroblocks due to the motion compensation. Therefore, a limited intracod- ing mode may be signaled which enables a prediction only of intra-coded adjacent macroblocks.

If the intra__4X mode is used, any 4x4 block is predicted from spatially adjacent samples. The 16 samples of the 4x4 block are predicted using the previously decoded samples in adjacent blocks. For each 4x4 block, one of 9 prediction modes may be used. In addition to the DC prediction" (wherein one value is used to predict the entire 4x4 block), 8 direction-prediction modes are specified. These modes are suitable for predicting structures of direction in a picture, such as edges with various angles.

In addition to the intra-macroblock coding types, different predictive or motion-compensated coding types are specified as P macroblock types. Each P macroblock type corresponds to a specific partition of the macroblock into the block forms used for motion-compensated prediction. Partitions with lu a block sizes of 16x16, 16x8, 8x16 and 8x8 samples are supported by the syntax. In the event of partitions of 8x8 samples, an additional syntax element is transmitted for each 8x8 partition. This syntax element specifies whether the corresponding 8x8 partition further is parti- tioned into partitions of 8x4, 4x8 or 4x4 luma samples and corresponding chroma samples.

The prediction signal for each predictively coded MxM luma block is obtained -|by shifting an area of the corresponding reference picture, ' which area is specified by a translation motion vector and a picture reference index. If, thus, a macroblock is coded using four 8x8 partitions, and if each 8x8 partition further is partitioned into four 4x4 parti- tions, a maximum amount of 16 motion vectors may be transmitted for a single P macroblock within the framework of the so-called motion field.

The quantization parameter slice QP is used to specify the quantization of the transform coefficients with H.264/AVC. The parameter may take on 52 values. These values are arranged such that an increase of 1 with regard to the quantization parameter signifies a stepwise increase in quantization by about 12 %. This means that an increase in the quantization parameter by 6 entails an increase in the quantizer step size by a factor of precisely 2. It shall be pointed out that a change in the step size by about 12 % also signifies a reduction of the bit rate by about 12 %.

The quantized transform coefficients of a block are generally sampled in a zigzag path and are processed further using entropy coding methods. The 2x2 DC coefficients of the chroma component are sampled in a raster-scan order, and all inverse transform operations within H.264/AVC may be implemented using only additions and shift operations of 16-bits integer values. Similarly, only 16-bits memory accesses are required for good implementation of the forward transformation and of the quantizing method in the encoder.

With reference to Fig. 9, the input picture is initially partitioned into the macroblocks with 16x16 pixels in a picture-by-picture manner in a video sequence, considered for each picture, respectively. Hereupon, each picture is fed to a subtracter 84 which subtracts the original picture from a picture provided by a decoder 85 which is contained in the encoder. The subtraction result, i.e. the residual signals in the spatial domain, are now transformed, scaled and quantized (blqck 86) to obtain the quantized transform coefficients on line 81b. To generate the subtraction signal fed to the subtracter 84, the quantized transform coefficients are initially scaled again and inversely transformed (block 87) so as to be fed to an adder 88, the out- put of which feeds the deblocking filter 89, it being possible to monitor, for control purposes, the output video signal as will be decoded, e.g., by a decoder (output 90), at the output of the deblocking filter.

Using the decoded output signal at output 90, a motion estimation is then performed in a block 91. For performing the motion estimation in block 91, a picture of the original input video signal is supplied, as may be seen from Fig. 9. The standard allows two different motion estima- tions, i.e. a forward-motion estimation and a backward- motion estimation. Forward-motion estimation involves estimating the motion of the current picture with regard to the previous pictures. However, backward-motion estimation involves estimating the movement of the previous picture us- ing the current picture. The results of the motion estimation (block 91) are fed to a motion-compensation block 92 which performs a motion-compensated inter prediction particularly when a switch 93 is switched to the inter prediction mode, as is the case in Fig. 9. However, if the switch 93 is tuned to intra-frame prediction, an intra-frame prediction is performed using a block 490. The motion data are no longer needed for this purpose, since no motion compensation is performed for an intra-frame prediction.

The motion estimation block 91 generates motion data and/or motion fields, with motion data and/or motion fields which consist of motion vectors being transmitted by the decoder so that a corresponding inverse prediction, i.e. recon- struction, may be performed using the transform coefficients and the motion data. It shall be pointed out that in the event of a forward prediction, the motion vector may be calculated from the immediately preceding picture and/or also from several -.preceding pictures. In addition, it shall be pointed out that in the event of a backward prediction, a current picture may be calculated using the immediately adjoining future picture, and, of course, further future pictures .

The video coding concept represented in Fig. 9 has the disadvantage that it offers no simple possibility of scalability. As is known from the art, the expression "scalability" is used to refer to an encoder/decoder concept wherein the encoder provides a scaled data stream. The scaled data stream includes a base scaling layer, as well as one or several extension scaling layers. The base scaling layer includes a representation of the signal to be coded, which is, generally speaking, of lower quality, but also has a lower data rate. The extension scaling layer contains a further representation of the video signal, which representation typically provides, along with the representation of the video signal in the base scaling layer, a representation of improved quality with regard to the base scaling layer. By contrast, the extension scaling layer has a bit demand of its own, of course, so that the number of bits required for representing the signal to be coded increases with each extension layer.

Depending on the implementation and/or on the possibilities, a decoder will either decode only the base scaling layer to provide a comparatively poor-quality representation of the picture signal represented by the signal coded. With each "addition" of a further scaling layer, the de- coder may, however, improve the quality of the signal step by step (at the expense of the bit rate and the delay) . Depending on the implementation and the transmission channel from an encoder to a decoder, at least the base scaling layer is always transmitted, since the bit rate of the base scaling layer is typically so low that even a transmission channel which has ,been limited so far will be sufficient. If the transmissio'n channel does not provide for a larger bandwidth for the application, only the base scaling layer will be transmitted, but no extension scaling layer. Consequently, the decoder is only able to produce a low-quality representation of the picture signal. Compared to the unsealed case, wherein the data rate would have been so high that a transmission by the transmission system would not have been possible at all, the low-quality representation is advantageous. If the transmission channel allows the transmission of one or several extension layers, the encoder will also transmit one or several extension layers to the decoder, so that the latter may increase the quality of the video signal output step by step, depending on the request.

With regard to the coding of video sequences, there are two different scalings. One scaling is temporal in the sense that, e.g., not all individual video pictures of a video sequence are transmitted, but that - in order to reduce the data rate - e.g., only one in two, one in three, one in four etc. pictures is transmitted.

The other scaling is SNR scalability (SNR = signal to noise ratio), wherein each scaling layer, i.e. both the base scaling layer and the first, second, third, ... extension scaling layer, includes all temporal information, but with differing qualities. Thus, even though the base scaling layer would have a lower data rate, it would have a low signal/noise ratio, it being possible then to improve this signal/noise ratio step by step by including one extension scaling layer, respectively. The encoder concept depicted in Fig. 9 is problematic in the sense that it is based on the fact that only residual values are produced by the subtracter 84 and are then processed further. These residual values are calculated by means of prediction algorithms, to be precise in the arrangement shown in Fig. 9, which forms a closed loop using blocks 86, 87, 88, 89, 92, 93, 94 and 84, a quantization parameter coming into the closed loop, to be precise in blocks 86, 87. If simple SNR scalability were implemented to the effect that, e.g., every predicted residual signal is initially quantized with a coarse quantizer step size, and would then be quantized step by step using extension layers with finer quantization step sizes, this would have the following consequences. Due to the inverse quantization and the prediction in particular with regard to the motion estimation (block 91) and the motion compensation (block 92) occurring using the original picture, on the one hand, and the quantized picture, on the other hand, a "drifting apart" of the quantization step sizes both in the encoder and in the decoder results. This leads to the fact that the generation of the extension scaling layers on the encoder side becomes very problematic. In addition, processing of the extension scaling layers on the decoder side becomes impossible at least with regard to the elements defined in the H.264/AVC standard. The reason for this is the closed loop in the video encoder, depicted with reference to Fig. 9, which contains the quantization.

The encoder/decoder concept currently standardized thus is not very flexible with regard to scalability considerations .

It is the object of the present invention to provide a more flexible concept for coding/decoding picture signals.

This object is achieved by an apparatus for coding a group of successive pictures as claimed in claim 1, by an apparatus for decoding a coded signal as claimed in claim 17, by a method for coding a group of successive pictures as claimed in claim 20, by a method for decoding a picture signal as claimed in claim 21, by a filter device as claimed in claim 22, by an inverse filter device as claimed in claim 23, by a imethod for filtering as claimed in claim

24, by a method for inverse filtering as claimed in claim

25, or by a computer program as claimed in claim 26.

The present invention is based on the findings that the closed loop in the video encoder, which closed loop is problematic with regard to scalability considerations, may be broken through by moving away from processing pictures on a picture-by-picture basis, and by performing group-wise processing of pictures instead. The group of pictures is decomposed into high-pass pictures and low-pass pictures, respectively, using a lifting scheme using several filter levels, so as to no longer transform, scale and quantize residual signals of individual pictures, as in the prior art, but to transform, scale and quantize high-pass pic- tures and/or low-pass pictures and to then subject them to entropy coding. Thus, a subband method for, preferably implemented as a wavelet filter, is provided upstream, as it were, in the video encoder, the filter decomposition breaking through the closed loop of the standardized decoder, so that temporal or SNR scalability may readily be implemented. Further processing, i.e. transformation, scaling, quantizing and entropy-coding, is no longer performed on residual signals in a spatial respect, but on residual signals, i.e. high-pass signals, in a temporal respect, since high-pass/low-pass filtering is performed across the various filter levels in a temporal respect, i.e. seen across the group of pictures.

Thus, the present invention preferably provides an SNR scalable extension of the H.264/AVC video standard. For this purpose, i.e. to obtain an efficient SNR-scalable bit stream representation of a video sequence, the temporal interdependence between pictures is coded using a subband ap- proach operating with an open loop, i.e. without the problematic closed loop. In this encoder/decoder scenario, most components of H.264/AVC are used as specified in the standard, only few changes being required with regard to the subband-coder strupture.

In a preferred embodiment of the present invention, each filter level includes a backward predictor, on the one hand, and a forward predictor, on the other hand. The back- ward predictor performs a backward-motion compensation. The forward predictor performs a forward-motion compensation. The output signal of the backward predictor is subtracted from a picture of the original group of pictures so as to obtain a high-pass picture. The high-pass picture is fed to the forward predictor as an input signal so as to obtain, on the output side, a forward-predicted signal which will be added to the picture signal representing the other picture, so as to obtain a low-pass picture which will then be decomposed again into a high-pass signal and a low-pass signal by means of a lower filter level, the low-pass signal including the similarities of two pictures considered, whereas the high-pass signal includes the differences of two pictures considered.

A filter level requires at least the processing of a group of two pictures. If several filter levels exist, a grouping of, for example, four pictures is required if two filter levels are used. If there are three filter levels, a grouping of eight pictures is required. If, however, four filter levels exist, 16 pictures should be grouped into a group of pictures and be processed together. It shall be pointed out that the group of pictures may be selected to have any size desired, but that a group of at least two filter levels should include at least four pictures. Depending on the ap- plication, large group sizes are preferred, it being necessary, however, in this case, to have correspondingly larger subband filterbanks for decomposition on the encoder side and for composition on the decoder side. Thus, temporal subband coding of the video sequences is performed, in accordance with the invention, before the actual video coding in accordance with the H.264/AVC stan- dard, so that quantization using a quantization step size is taken out of the closed loop shown m Fig. 9, to the effect that now simple SNR scalability may be achieved. Individual scaling layers may now be readily produced using various individual quantizer step sizes. The inventive de- composition is based on the lifting representation of a filterbank. This lifting representation of the temporal subband decompositions allows the use of known methods for motion-compensated prediction. In addition, most other components of a hybrid video encoder, such as H.264/AVC, may be utilized without modification, while only few parts need to be changed.

Preferred embodiments of the present invention will be described below in more detail with reference to the accompa- nying figures, wherein:

Fig. 1 shows a block diagram of an inventive encoder for coding a group of successive pictures;

Fig. 2 depicts a block diagram of an inventive decoder for decoding a picture signal;

Fig. 3 shows a block diagram of an inventive decoder in accordance with a preferred embodiment of the pre- sent invention, having four levels;

Fig. 4 shows a block diagram for illustrating the lifting decomposition of a temporal subband filterbank;

Fig. 5 shows a block diagram for illustrating time scaling; Fig. 6 shows a block diagram for illustrating SNR scaling;

Fig. 7 depicts an overview diagram for illustrating the temporal decomposition of a group of, e.g., eight pictures;

Fig. 8 shows a preferred temporal placement of low-pass pictures for a group of 16 pictures;

Fig. 9 shows an overview block diagram for illustrating the fundamental coding structure for an encoder in accordance with the H.264/AVC standard for a macroblock;

Fig. 10 shows a context arrangement consisting of two adjacent pixel elements A and B to the left-hand side and/or above a current syntax element C; and

Fig. 11 depicts a representation of the partition of a picture into slices.

Fig. 1 shows an apparatus for coding a group of successive pictures, the group comprising at least first, second, third and fourth pictures. The group of pictures is fed into an input 10 of a second filter level 12. The second filter level is configured to produce, from the group of pictures, high-pass pictures at a first output 14, and low- pass pictures at a second output 16, of the filter level 12. Specifically, the second filter level 12 is configured to produce, from the first and second pictures of the group of pictures, a first high-pass picture of the second level and a first low-pass picture of the second level, and to produce, from the third and fourth pictures of the group of pictures, a second high-pass picture of the second level and a second low-pass picture of the second level. In the case described by way of example, wherein the group of pictures comprises four pictures, the second filter level 12 thus produces, on the output side of the high-pass output 14, two high-pass pictures of the second level, and produces, on the output side of the low-pass output 16, two low-pass pictures of the second level. Optionally, the high-pass pictures of the second level are fed to second further-processing means. The second further-processing means are implemented to process further the first high- pass picture of the second level as well as the second high-pass picture of the second level, the second further- processing means 18 comprising a quantizer which includes a quantization step size (QN. SS) . By way of example, the second further-processing means 18 are configured to per- form the functionalities of blocks 86, 80 of Fig. 9. The output signal of the second further-processing means, i.e. a quantized and, as the case may be, entropy-coded representation of the high-pass pictures of the second level, is written into the output-side bit stream in the event of SNR scalability.

In the event of temporal scalability, this signal already represents the first extension scaling layer. If an encoder produces only a base scaling layer, no further processing of the high-pass pictures will be necessary, which is why the connection between block 12 and block 18 is depicted as a dashed line.

Both low-pass pictures of the second level which are output at the low-pass output 16 of the second filter level are fed into an input of a first filter level 20. The first filter level 20 is configured to produce, from the first and second low-pass pictures of the second level, a first high-pass picture of the first level and a first low-pass picture of the first level. The first high-pass picture of the first level produced by filter level 20 is output at a high-pass output 22 of the first filter level. The first low-pass picture of the first level is output at a low-pass output 24 of the first filter level.

Both signals are fed to first further-processing means for further processing) the first high-pass picture of the first level and the first low-pass picture of the first level, so as to obtain a coded picture signal at an output 28, the first further-processing means 26 comprising, as has already been depicted by means of the second further- processing means 18, a quantizer which has a quantizer step size. Preferably, the quantizer step sizes of the second further-processing means 18 and of the first further- processing means 26 are identical.

The output signal of the first further-processing means 26 thus includes the high-pass picture of the first level and the low-pass picture of the first level and thus represents the base scaling layer for the purposes of temporal scalability.

For the purposes of SNR scalability, the coded picture signal at the output 28, along with the coded high-pass pictures of the second level at the output of the second further-processing means 18, represents the output signal, which is a base scaling layer.

Fig. 2 shows an apparatus for decoding a coded signal, to be precise for decoding that signal which is output as a coded picture signal at the output 28 of Fig. 1. As has been set forth, the coded picture signal at the output 28 includes a quantized and entropy-coded representation of the first high-pass picture and of the first low-pass picture of the first level. This information, i.e. the signal at the output 28, is fed into inverse further-processing means 30 performing inverse further processing using and knowing the quantization step size used by the first further-processing means 26 of Fig. 6, this inverse further processing including, for example, entropy decoding, in- verse quantization as well as back transformation etc., as is known in the art and as is preset by an encoder and/or the further-processing means used in the encoder.

A reconstructed version of the first high-pass picture as well as a reconstructed version of the first low-pass picture will then be present at an output of the inverse further-processing means 30. These two reconstructed versions of the first high-pass picture and of the first low-pass picture are fed into an inverse filtering level 32, the inverse filtering level 32 being configured to filter the reconstructed version of the first high-pass picture and the reconstructed version of the first low-pass picture in an inverse manner so as to obtain a reconstructed version of a first low-pass picture and of a second low-pass picture of a level which is one up, i.e. of the second level. In the case of temporal scalability, these reconstructed versions of the first and second low-pass pictures of both levels represent the base layer. In the case of SNR scalability, the output signal of the second further-processing means 18 of Fig. 1 is, in addition, also subjected to inverse further processing using the corresponding quantization step size, so as then to be fed, along with the reconstructed versions of the first and second low-pass pictures of the second level, into an inverse filtering level of a next order, as will be explained below in more detail, for example with reference to Fig. 3.

Wavelet-based video coding algorithms, wherein lifting im- plementations are employed for wavelet analysis and for wavelet synthesis, have been described in J.-R. Ohm, "Complexity and delay analysis of CTF interframe wavelet structures", ISO/IECJTC1/WG11 Doc.M8520, July 2002. Annotations on scalability may also be found in D. Taubman, "Suc- cessive refinement of video: fundamental issues, past efforts and new directions", Proc. of SPIE (VCIP'03) vol. 5150, pp. 649-663, 2003, the latter requiring, however, considerable changes to encoder structures. In accordance with the invention, however, an encoder/decoder concept is achieved which exhibits the possibility of scalability, on the one hand, and which may be built, on the other hand, on elements conforming to the standard, in particular, e.g., for motion compensation.

Before giving a detailed description of an encoder/decoder structure with regard to Fig. 3, a fundamental lifting scheme on both sides of the encoder, and/or an inverse lifting scheme on the side of the decoder shall initially be depicted with regard to Fig. 4. Detailed explanations about the background of the combination of lifting schemes and wavelet transformations may be found in W. Sweldens, "A custom-design construction of biorthogonal wavelets", J. Appl. Comp. Harm. Anal., vol. 3 (no. 2), pp. 186 - 200, 1996 and I. Daubechies and W. Sweldens, "Factoring wavelet transforms into lifting steps", J. Fourier Anal. Appl., vol. 4 (no. 3), pp. 247 - 269, 1998. Generally speaking, the lifting scheme consists of three steps, the polyphase decomposition stage, the prediction step and the update step, as is represented by means of the encoder in Fig. la. The polyphase decomposition stage is represented by a first area I, the prediction step is represented by a second area II, and the update step is represented by a third area III.

The decomposition stage includes partitioning the input- side data stream into an identical first copy for a lower branch 40a as well as an identical copy for the upper branch 40b. In addition, the identical copy of an upper branch 40b is delayed by a time stage (z^"1) , so that a sample s₂+i having an odd-numbered index k runs through a respective decimator, or downsampler 42a, 42b at the same point in time as a sample having an even-numbered index s₂ - The decimator 42a and/or 42b reduces the number of samples in the upper and lower branches 40b, 40a, respectively, by eliminating every other sample, respectively. The second area II, which relates to the prediction step, includes a prediction operator 43 as well as a subtracter 44. The third area, i.e. the update step, includes an update operator 45 as well as an adder 46. On the output side there are also two normalizers 47, 48 for normalizing the high-pass signal h_k (normalizer 47) and for normalizing the low-pass signal l_k by the normalizer 48.

Specifically, the polyphase decomposition results in the even and the odd samples of a given signal s[k] to be separated. Since the correlation structure typically shows a local characteristic, the even and odd polyphase components are highly correlated. Therefore, a prediction (P) of the odd samples is performed in a subsequent step using the even samples. The corresponding prediction operator (P) for each odd sample

s_odd[k] = s[2k + ϊ\

is a linear combination of the adj acent even samples

s_even[kl = 2k]

I

As a result of the prediction step, the odd samples are replaced by their corresponding prediction residual values [k] = s_odd[k\- J^>(s_ewn)[k] .

It shall be pointed out that the prediction step is equivalent to performing a high-pass filter of a two-channel fil- terbank, as is set forth in I. Daubechies and W. Sweldens, "Factoring wavelet transforms into lifting steps", J. Fourier Anal. Appl., vol. 4 (no. 3), pp. 247 - 269, 1998. In the third step of the lifting scheme, a low-pass filtering is performed by replacing the even samples

by a linear combination of the prediction residual values h[k]. The corresponding update operator U is given by

XJ(h)[k] = ∑u,h[k + l] .

By replacing the even samples by

l[k] = s_even[k] + V(h)[k]

the given signal s[k] may eventually be represented by l(k) and h(k), each signal having half the sample, however. Since both the update step and the prediction step are fully invertible, the corresponding transform may be inter- preted as a critically sampled perfect-reconstruction filterbank. It may indeed be demonstrated that any biorthogo- nal family of FIR filters may be realized by a sequence of one or several prediction steps and one or several update steps. For normalizing the low-pass and high-pass co po- nents, the normalizers 47 and 48 are supplied with suitably selected scaling factors Fi and F_h, as has been set forth.

The inverse lifting scheme corresponding to the synthesis filterbank is shown on the right-hand side in Fig. 4. It simply consists of applying the prediction and update operators in an reverse order and with inverse signs, followed by the reconstruction using the even and odd polyphase components. In particular, the decoder shown on the right-hand side in Fig. 4 thus, again, includes a first de- coder area I, a second decoder area II as well as third decoder area III. The first decoder area reverses the action of the update operator 45. This is effected by feeding the high-pass signal, which has been normalized back by a further normalizer 50, to the update operator 45. The output signal of the decoder-side update operator 45 is then fed to a subtracter 52 rather than to the adder 46 in Fig. 4. The same procedure) is followed for the output signal of the predictor 43, whose output signal no longer is fed to a subtracter, as on the encoder side, but now is rather fed to an adder 53. Now the signal is upsampled by a factor of 2 in each branch (blocks 54a, 54b) . Thereupon, the upper branch is shifted towards the future by one sample, which is equivalent to delaying the lower branch, so as to then perform an addition of the data streams on the upper branch and the lower branch in an adder 55 so to obtain the reconstructed signal Sjc at the output of the synthesis filter- bank.

Different wavelets may be implemented by predictor 43 and/or updator 45. If the so-called hair wavelet is to be implemented, the prediction operator and the update opera- tor are given by the following equation

such that

h[k] = s[2k + l] -s[2k] and l[k] = s[2k] + -h[k] = -(s[2k]+ s[2k + l]) 2 2.

correspond to the not-normalized high-pass and/or low-pass analysis output signals of the hair filter.

In the case of the 5/3 biorthogonal spline wavelets, the low-pass and the high-pass analysis filters of this wavelet have 5 and 3 filter taps, respectively, the corresponding scaling function being a B spline of the order 2. In coding applications for still images (such as JPEG 2000), this wavelet is used for a temporal subband coding scheme. In a lifting environment, the corresponding prediction and update operators of the 5/3 transform are given as follows (h[k]+h[k-l ) .

Fig. 3 shows a block diagram of an inventive encoder/decoder structure with four exemplary filter levels both on the side of the encoder and on the side of the decoder. It may be seen from Fig. 3 that the first, second, third and fourth filter levels are identical in relation to the encoder. In relation to the decoder, the filter levels are also identical. On the encoder side, each filter level includes, as central elements, a backward predictor Mo 60 as well as a forward predictor Mn 61. In principle, back- ward predictor 60 corresponds to predictor 43 of Fig. 4, whereas forward predictor 61 corresponds to updator 45 of Fig. 4.

It shall be pointed out that, in contrast with Fig. 3, Fig. 4 relates to a stream of samples, wherein a sample has an odd index 2k+l, whereas another sample has an even index 2k. However, the notation given in Fig. 3 relates to a group of pictures rather than a group of samples, as has been set forth with reference to Fig. 1. If a picture has, e.g., a number of samples or pixels, this picture is fed in as a whole. Then the next picture is fed in, etc. Thus, there are no more odd and even samples, but odd and even pictures. In accordance with the invention, the lifting scheme described for odd and even samples is applied to odd and even pictures, respectively, each of which has a plurality of samples. The sample-wise predictor 43 of Fig. 4 now becomes the backward-motion compensation prediction 60, whereas the sample-by-sample updator 45 becomes the picture-by-picture forward-motion compensation prediction 61.

It shall be pointed out that the motion filters, which consist of motion vectors and which represent coefficients for blocks 60 and 61, are in each case calculated for two interrelated pictures and are transmitted as side information from the encoder to the decoder. However, what is an essential advantage in the inventive concept is the fact that the elements 91, 9,2, as have been described with reference to Fig. 9 and are standardized in the H.264/AVC standard, may readily be used to calculate both the motion fields Mio and the motion fields Mu. Therefore, no new predic- tor/updator must be utilized for the inventive concept, but the algorithm which already exists, has been examined and checked for functionality and efficiency, and has been mentioned in the video standard, may be used for the motion compensation in the forward or backward directions.

In particular, the general structure of the filterbank used, which structure is represented in Fig. 3, shows a temporal decomposition of the video signal with a group of 16 pictures fed in at an input 64. The decomposition is a dyadic temporal decomposition of the video signal, there being a need, in the embodiment having four levels shown in Fig. 3, of 2⁴=16 pictures, i.e. of a group size of 16 pictures, to achieve the representation with the smallest temporal resolution, i.e. on the signals at the outputs 28a and 28b. If, therefore, 16 pictures are grouped, this leads to a delay of 16 pictures, which renders the concept having four levels, shown in Fig. 3, rather problematic with regard to interactive applications. If, therefore, interactive applications are aimed at, it is preferred to form smaller groups of pictures, such as four or eight pictures. Then the delay is reduced accordingly, so that a utilization becomes possible for interactive applications as well. In cases where interactivity is not required, e.g. for storing purposes etc., the number of pictures in a group, i.e. the group size, may be increased accordingly, for ex- ample to 32, 64 etc., pictures.

In accordance with the invention it is preferred to utilize the interactive application of the hair-based, motion- compensated lifting scheme which consists of a backward- motion compensation prediction (Mio) , as in H.264/AVC, and which further includes an update step which comprises a forward-motion compensation (Mn) . Both the prediction step and the update step utilize the motion-compensation process as is represented' in H.264/AVC. In addition, it is preferred to not only use the motion compensation, but to also use the deblocking filter referred to by reference numeral 89 in Fig. 9.

The second filter level, in turn, includes downsamplers 66a, 66b, a subtracter 69, a backward predictor 67, a forward predictor 68 as well as an adder 70, and, as has already been represented with regard to Fig. 1, the further- processing means 18 so as to output the first and second high-pass pictures of the second level at an output of the further-processing means 18, as has already been represented with regard to Fig. 1, whereas the first and second low-pass pictures of the second level are output at the output of the adder 70.

The inventive encoder in Fig. 3 additionally includes a third level as well as a fourth level, a group of 16 pictures being fed into the input 64 of the fourth level. Eight high-pass pictures which have been quantized with a quantization parameter Q and which have been processed further accordingly are output at a high-pass output 72 of the fourth level, which output is also referred to as HP4. Accordingly, eight low-pass pictures are output at a low-pass output 73 of the fourth filter level, which eight low-pass pictures are fed into an input 74 of the third filter level. Again, the level is operative to produce four high- pass pictures at a high-pass output 75, also referred to as HP3, and to produce, at a low-pass output 76, four low-pass pictures which are fed into the input 10 of the second filter level and are decomposed, as has been set forth with reference to Figs. 3 and 1. It shall be pointed out, specifically, that the group of pictures processed by one filter level need not necessarily be video pictures stemming from an original video sequence, but may also be low-pass pictures having been output at a low-pass output of the filter level from a filter level which is one up.

In addition, it shall be pointed out that the encoder concept designed for 16 pictures, which has been shown in Fig. 3, may readily be reduced to eight pictures by simply omitting the fourth filter level and feeding the group of pictures into the input 74. Similarly, the concept shown in Fig. 3 may also readily be expanded to a group of 32 pictures by adding a fifth filter level and by outputting the high-pass pictures, of which there will then be 16, at a high-pass output of the fifth filter level, and by feeding the 16 low-pass pictures into the input 64 of the fourth filter level at the output of the fifth filter level.

The tree-like concept of the encoder side is also applied on the decoder side, but now it no longer goes from the higher level to the lower level, as on the encoder side, but, on the decoder side, from the lower level to the higher level. For this purpose, the data stream is received from a transmission medium schematically referred to as network abstraction layer 100, and the bit stream received is initially subjected to inverse further processing using the inverse further-processing means 30a, 30b, so as to obtain a reconstructed version of the first high-pass picture of the first level at the output of means 30a, and a reconstructed version of the low-pass picture of the first level at the output of block 30b of Fig. 3. By analogy with the right-hand half of Fig. 4, the forward-motion compensation prediction will then initially be reversed by means of pre- dictor 61 so as to subtract the output signal of predictor 61 from the reconstructed version of the low-pass signal (subtracter 101) . The output signal of subtracter 101 is fed into a backward compensation predictor 60 so as to produce a prediction result which will be added, in an adder 102, to the reconstructed version of the high-pass picture. Thereupon both signals, i.e. the i signals in the upper and lower branches 103a, 103b, are brought to double the sampling rate, to be precise using the upsamplers 104a, 104b, the signal on the upper branch then being delayed and/or "accelerated", depending on the implementation. It shall be pointed out that the upsampling is performed by bridge 104a, 104b simply by inserting a number of zeros, which corresponds to the number of samples for one picture. The shift by the delay of a picture by the element shown as z^_1 in the upper branch 103b, in relation to the lower branch 103a, has the effect that the addition by an adder 106 results in both low-pass pictures of the second level being present in succession on the output side with regard to the adder 106.

The reconstructed versions of the first and second low-pass pictures of the second level are then fed into the decoder- side inverse filter of the second level, where they are combined, along with the transmitted high-pass pictures of the second level, again by the identical implementation of the inverse filterbank so as to have a sequence of four low-pass pictures of the third level at an output 108 of the second level. The four low-pass pictures of the third level are combined, in an inverse filter level of the third level, with the transmitted high-pass pictures of the third level so as to have eight low-pass pictures of the fourth level in a successive format at an output 110 of the inverse filter of the third level. Then these eight low-pass pictures of the third level are combined, in an inverse filter of the fourth level, with the eight high-pass pictures of the fourth level which are received from the transmission medium 100 via the input HP4, again as was discussed with reference to the first level, so as to obtain a reconstructed group of 16 pictures at an output 112 of the inverse filter of the fourth level. Thus, two pictures, i.e. either original pictures or pictures, which represent the low-pass signals and have been produced in a level which is one up, are decomposed into a low-pass signal and a high-pass signal in each step of the analysis filterbank. The low-pass signal may be considered as a representation of the similarities of the input pictures, whereas the high-pass signal may be considered as a representation of the differences between the input pic- tures. In the corresponding stage of the synthesis filter- bank, both input pictures are reconstructed using the low- pass and the high-pass signals. Since the inverse operations of the analysis step are performed in the synthesis step, the analysis/synthesis filterbank guarantees perfect reconstruction (without quantization, of course) .

The only occurring losses are due to the quantization in the further processing means, e.g. 26a, 26b, 18. If the quantization effected is very fine, a good signal/noise ra- tio is achieved. If, on the other hand, the quantization effected is very coarse, a relatively poor signal/noise ratio is achieved, however at a low bit rate, i.e. at a low bit demand.

Without SNR scalability, at least a time scaling control could be implemented with as little as with the concept shown in Fig. 3. For this purpose, a time scaling control 120 is utilized with reference to Fig. 5, the control 120 being configured to obtain, on the input side, the high- pass and low-pass outputs, respectively, and/or the outputs of the further-processing means (26a, 26b, 18 ...) , so as to produce, from these partial data streams TP1, HP1, HP2, HP3, HP4, a scaled data stream exhibiting, in a base scaling layer, the further-processed version of the first low- pass picture and of the first high-pass picture. A first extension scaling layer could then accommodate the further- processed version of the second high-pass pictures. A second extension scaling layer could then accommodate the fur- ther-processed versions of the high-pass pictures of a third level, whereas a third extension scaling layer includes the further-processed versions of the high-pass pictures of the fourth level. Thus, a decoder could produce, merely on the basis of the base scaling layer, a sequence of low-pass pictures of a low level, the sequence being of low quality in terms of time, i.e. two low-pass pictures of the first level per group of pictures. With the addition of each extension scaling layer, the number of pictures recon- structed per group may always be doubled. The functionality of the decoder is typically controlled by a scaling control configured to recognize how many scaling layers are contained in the data stream, and/or how many scaling layers are to be taken into account by the decoder in decoding.

An SNR scaling functionality as is preferred for the present invention will be explained in more detail below with reference to Fig. 6. The SNR scalability in accordance with Fig. 6 will be explained using the signals HP4, HP3, HP2, HPl, TPl output from Fig. 3. An encoder provided with SNR scalability additionally includes, apart from the elements depicted in Fig. 3, inverse further-processing means 140 which will contain, as a matter of principle, the elements designated by Q^-1 on the decoder side of Fig. 3. However, it shall be pointed out that the inverse further-processing means 140 are provided on the encoder side to provide the SNR scalability. The inverse further-processing means 140 produce an inversely quantized and inversely further processed representation of the low-pass pictures of the first level TPl, of the high-pass pictures of the first level HPl, of the high-pass pictures of the second level, of the high-pass pictures of the third level and of the high-pass pictures of the fourth level.

These inversely quantized video signals are fed to a component subtracter 142, which, at the same time, includes the first low-pass picture TPl, the first high-pass picture HPl, the high-pass pictures of the second level, the high- pass pictures of the third level and the high-pass pictures of the fourth level prior to quantization, i.e. at the output of the subtracter 69 and adder 70, respectively, of Fig. 3. Subsequently, a subtraction is performed component by component, i.e.,, for example, the first inversely quantized low-pass picture of the first level is subtracted from the first low-pass picture of the first level prior to quantization, etc., to obtain respective quantization errors for the first low-pass picture of the first level F_TPι, the first high-pass picture of the first level F_Hpι etc. These quantization errors, which are due to the fact that originally a quantization of the first quantization step size Qn. SS was performed, are fed into further-processing means 144 performing further processing of, preferably, all quantization error signals F using a second quantizer step size smaller than the first quantizer step size. While the output signal from means 144 represents a first extension scaling layer, the input signal into the inverse further- processing means 140 represents the base scaling layer which may be transmitted, just like the first extension scaling layer, into the medium 100, or, in logical terms, into the network abstraction layer, where additional signal format manipulations etc. may be performed depending on the form of implementation.

For producing a second extension scaling layer it shall be assumed, starting from Fig. 6, that the first extension scaling layer is inversely further processed again using the second quantizer step size on the basis of block 144. The result will then be added to the data which has been inversely quantized using the first quantizer step size, so as to produce a decoded representation of the picture signal on the basis of the base scaling layer and the first extension scaling layer. This representation is then fed again into a component subtracter so as to calculate the remaining error on the basis of the quantization using the first quantizer step size and the second quantizer step size. Subsequently, the error signal obtained therefrom is to be quantized again using a third quantizer step size smaller than the first and second quantizer step sizes, so as to produce a second extension scaling layer.

It shall be pointed out that every scaling layer may have not only the further-processed low-pass/high-pass pictures of the various levels, but, in addition, may have the forward motion field Mio as well as the backward motion field Mi_l as side information. It shall also be pointed out that a forward motion field as well as a backward motion field are produced for two successive pictures, respectively, these motion fields, or, generally speaking, prediction information, being transferred, in addition to the further- processed high-pass/low-pass pictures, in the form of side information from the encoder to the decoder.

A more detailed description will be given below of the prediction step and the update step of the analysis and synthesis processes. The motion fields Mio and Mn generally specify the motion between two pictures using a subclause of the P-slice syntax of H.264/AVC. For the motion fields Mio, which are used by the prediction steps, it is preferred to include an INTRA-like macroblock-based fall-back mode, wherein the (motion-compensated) prediction signals for a macroblock are specified, in a manner similar to the INTRAi6_Xi6 macroblock mode with H.264/AVC, by a 4x4 array of luma transform coefficient levels and two 2x2 arrays of chroma transform coefficient levels, all AC coefficients being set to zero. This mode is not utilized in the motion fields Mn used for the update steps.

Below, a description will be given of the general motion- compensation prediction process utilized by the prediction and/or update steps both on the analysis side and on the synthesis side. A reference picture R, a quantization parameter QP (if required) , and a block-wise motion field M are input into this process, the following properties applying:

For each macroblσpk of the motion-compensated prediction picture P, the mot!ion field M specifies a macroblock mode which may be P-ι_6xl6, P-iexe, P-βxie, P-8χ8 or INTRA.

If the macroblock mode is P-β_xβ_r a corresponding sub- macroblock mode is specified for each 8x8 sub-macroblock (P-8x8c P-8x4 P-4X8Λ P-4_X4) •

If the macroblock mode equals INTRA, the generation of the prediction signal is specified by a 4x4 array of luminance coefficient levels and by two 2x2 arrays of chrominance coefficient levels.

Otherwise, the generation of the prediction signal is specified by a motion vector with a quarter-sample accuracy for each macroblock or sub-macroblock partition.

With the reference picture R and the motion-field description M, the motion-compensated prediction signal P, which will then be subtracted and/or added, is structured in a macroblock manner as will be described below:

If the macroblock mode, which is specified in M, is not = INTRA, the following will be implemented for each macroblock partition or for each sub-macroblock partition:

The luma and chroma samples of the picture included by the corresponding macroblock or sub-macroblock partition are obtained by a motion-compensated, quarter-sample accuracy prediction which conforms to the standard ISO/IEC 14496-10 AVC, Doc. JVT-G050rl, May 2003

p[i,j] = Mint(r, i - m_x, j - m_y) . In the above equation, [m_x, m_y]^τ is the motion vector of the macroblock considered and/or of the sub-macroblock considered, given by M. r[] is the array of luma or chroma samples of the reference picture N. In addition, Mi_nt() represents the interpolation process specified for the motion- compensated prediction in H.264/AVC, however with the exception that clipping to the interval [0;255] is eliminated.

Otherwise (if the macroblock mode is INTRA), the following applies :

The given 4x4 array of luminance transform coefficient lev- els is treated as an array of DC-luma coefficient levels for the INTRAi_6xi6 macroblock mode in H.264/AVC, the inverse scaling transformation process in accordance with H.264/AVC being used, to be precise using the given quantization parameter QP, it being assumed that all AC transform coeffi- cient levels are set to zero. As a result, a 16x16 array res[] is obtained from residual luma samples. The luma samples of the prediction picture P, which relate to the macroblock considered, are calculated as follows:

p[i, j] = 128 + res[i, j] .

It shall be pointed out here that for each 4x4 luma block, the prediction signal P[] obtained is constant and represents an approximation of the average of the original 4x4 luma block.

For each chrominance component, the given 2x2 array of chrominance transform coefficient levels is treated as an array with DC chroma coefficient levels, the inverse scal- ing/transformation process for chroma coefficients being utilized in accordance with the standard, to be precise using the given quantization parameter QP, it being assumed that all AC transform coefficient levels are zero. As a re- suit, an 8x8 array res[] is obtained from residual chroma samples. The chroma samples of the prediction picture P, which relate to the macroblock considered, are calculated as follows:

p[i, j] = 128 + res [i, j] .

It shall be pointed out that for each 4x4 chroma block, the prediction signal p[] obtained is constant and represents an approximation of the average of the original 4x4 chroma block.

Once the entire prediction signal picture P has been produced, the deblocking filter as is defined in accordance with the standard is preferably applied to this picture, the derivation of the boundary filter strength being based only on the macroblock modes (information about INTRA) and, in particular, on the motion vectors specified in the motion description M. In addition, the clipping to the inter- val [0;255] is eliminated.

As may be seen from the above description, the general process of producing motion-compensated prediction pictures is almost identical with the reconstruction process of P slices in accordance with H.264/AVC. However, the following differences may be seen:

The clipping to the interval [0;255] performed in the processes of the motion-compensated prediction and of the de- blocking is eliminated.

In addition, a simplified INTRA-mode reconstruction is performed without INTRA prediction, all AC transform coefficient levels being additionally set to zero. Moreover, a simplified reconstruction is performed for the motion- compensated prediction modes without residual information. Reference shall be made below to the prediction step on the analysis side (coder side) . To input pictures A and B as well as the motion field Mι₀ shall be given, which motion field represents the block-wise motion of picture B in re- lation to picture -..A. In addition, a quantization parameter QP shall be given, ^'' the following operations being performed to obtain a residual picture H:

A picture P, which represents a prediction of picture B, is obtained by activating the above-described methods, the reference picture A being used as the motion-field description M_i0, and the quantization parameter QP being used as the input .

The residual picture H is generated by:

h[i, j] = b[i, j] - p[i, j] ,

wherein h[], b[] and p[] represent the luma or chroma sa - pies of one of the pictures H, B and P, respectively.

The update step on the analysis side (coder side) will be described below. For this purpose, the input picture A, the residual picture H obtained in the prediction step, as well as the motion field H_±i representing the block-wise motion of picture A in relation to picture B, shall be given, the following operations also being performed to obtain a picture L representing the temporal low-pass signal.

A picture P is produced by invoking the above-described process, the picture H, however, representing the reference picture, and the motion-field description Mu being used as the input.

The low-pass picture L is produced by the following equation:

l[i,j] = a[i,j] + (p[i,j] » 1), wherein 1[], a[] and p[] represent the luma or chroma sample arrays of pictures L, A, and P, respectively. In addition it shall be pointed out that the operator » repre- sents a shift to the right by one bit, i.e. halving of the value.

The update step on the synthesis side (decoder side) will be described below. A low-pass filter L, the residual pic- ture H as well as the motion field Mu shall be given, the following operations being performed to obtain the reconstructed picture A:

A picture P is produced by invoking the above-described process, H being used as the reference picture, and the motion-field description M_ϋ being used as the input. The reconstructed picture A is produced by:

a[i,j] = l[i,j] - (p[i, j] » 1) ,

wherein a[], 1[] and p[] represent the sample arrays of pictures A, L, and P, respectively.

The prediction step on the synthesis side (decoder side) will be described below. The residual picture H, the reconstructed picture A obtained in the update step, as well as the motion field M_i0 shall be given, the following operations being performed to obtain the reconstructed picture B:

A picture P representing a prediction of picture B is produced by invoking the above-described method, picture A being used as the reference picture, the motion-field description M_i0 being used, the quantization parameter QP be- ing used as the input.

The reconstructed picture B is produced by b [i , j ] = h [ i, j ] + p [i , j ] ,

wherein b[], h[] and p[] represent the sample arrays of the pictures B, H, P, respectively.

By cascading the principally pair-wise picture- decomposition stages, a dyadic tree structure is obtained, which decomposes a group of 2ⁿ pictures into 2^n-1 residual pictures and a single low-pass (or intra) picture, as is shown for a group of 8 pictures with reference to Fig. 7. In particular, Fig. 7 shows the high-pass picture HPl of the first level at the output 22 of the filter of the first level, as well as the low-pass picture of the first level at the output 24 of the filter of the first level. Both low-pass pictures TP2 at the output 16 of the filter of the second level, as well as both high-pass pictures processed by the second further-processing means 18 of Fig. 3 are shown as the second level in Fig. 7.

The low-pass pictures of the third level are present at the output 76 of the filter of the third level, whereas the high-pass pictures of the third level are present at the output 75 in a further-processed form. The group of 8 pictures could include 8 original video pictures - in this case, the encoder of Fig. 3 would be employed without a fourth filter level. If, on the other hand, the group of 8 pictures is a group of 8 low-pass pictures, as are employed at the output 73 of the filter of the fourth level, the inventive encoder may also be employed.

Generally speaking, (2ⁿ⁺¹-2) motion-field descriptions, (2ⁿ- 1) residual pictures as well as a single low-pass (or intra) picture are transmitted for a group of 2ⁿ pictures.

The motion-field descriptions are coded using a subset of the H.264/AVC slice layer syntax comprising the following syntax elements: - slice header A (the meaning of certain elements having been changed) - slice data (subset) o macroblock JLayer (subset) - mb_type (P_16xl6, p_l6xl8, P_8xl6, P_8x8, INTRA) - if ( mb_type = = P_8x8 ) • sub_mb_type (P_8x8, P_8.4, P_4x8, P_4x4) - if ( mb_tyρe != INTRA ) • mb_qp_delta • residual blocks (only LUMA_DC and CHROMA_DC) ^■ else • motion vector differences o end_of_slice_flag - rbsp_slice_trailing_bits

The motion-vector predictors are derived as is specified in standard.

In relation to the residual pictures (high-pass signals) it shall be pointed out that the residual pictures are coded using a subset of the H.264/AVC slice layer syntax comprising the following syntax elements:

- slice header A (the meaning of certain elements having been changed) - slice data (subset) o macroblock layer (subset) ^■ coded_block_pattern ■ mb__qp_de11a ^■ residual blocks o end_of_slice_flag - rbsp_slice trailing bits With regard to the intra pictures (low-pass pictures) it shall be pointed out that they are generally coded using the syntax of the standard. In the simplest form, the low- pass pictures of each group of pictures are independently coded as intra piptures. The coding efficiency may be improved if the correlations between the low-pass pictures of successive picture groups are exploited. Thus, in a more general form, the low-pass pictures are coded as P pictures while using reconstructed low-pass pictures of previous picture groups as references. Intra (IDR) pictures are employed at regular intervals to provide random points of access. The low-pass pictures are decoded and reconstructed as is specified in the standard, to be precise using the deblocking filter operation.

As has already been set forth, the present invention provides SNR scalability. The structure in the form of an open loop rather than a closed loop of the subband approach provides the possibility of accommodating SNR scalability in a manner which is efficient and conforms to the standard to a great extent. What is achieved is an SNR-scalable extension wherein the base layer is coded as has been described above, and wherein the extension layers consist of improvement pictures for the subband signals which themselves are again coded as is provided for the residual-picture syntax.

On the encoder side, reconstruction error pictures are produced between the original subband pictures produced by the analysis filterbank, and the reconstructed subband pictures produced after decoding the base layer or a previous extension layer. These reconstruction error pictures are quantized and coded using a quantization parameter which is smaller than in the base scaling layer or in one or several previous extension scaling layers, to be precise using the residual-picture syntax set forth above. On the decoder side, the subband representation of the base layer, and the improvement signals of various extension layers may be decoded independently of one another, the eventual extension- layer subband representation being obtained by adding up the base-layer reconstruction and reconstructed improvement signals of the extension layers for all temporal subbands.

It shall be pointed out that performance losses are relatively small compared to non-scaled applications if the quantization parameters are reduced by a value of 6 from one layer to the next. This halving of the quantization step size results roughly in doubling the bit rate from one extension scaling layer to the next extension scaling layer, and from the base scaling layer to the first extension scaling layer, respectively.

It shall be pointed out here that the various scalings rep- resented with regard to Figs. 5 and 6, i.e. the time scaling and the SNR scaling, may be combined to the effect that a data stream scaled in terms of time may, of course, still be SNR-scaled.

Certain possibilities of operation-coder control shall be described below. Initially, preferred quantization parameters will be set forth. If the motion is neglected and if the bit jitter, or the shifting of bits, to the right in the update step, as has been set forth above, is replaced by a real-valued multiplication by a factor of 1/2, the fundamental two-channel analysis step may be normalized by multiplying the high-pass samples of the picture H by a factor of 1/sqrt (2) and by multiplying the high-pass samples by a factor of sqrt (2) .

Preferably, this normalization is neglected in the realization of the analysis filterbank and the synthesis filter- bank so as to keep the range of the samples nearly constant. However, this normalization is taken into account during the quantization of the temporal subbands. For the fundamental two-channel analysis synthesis filterbank, this may be readily achieved by quantizing the low-pass signal with half of the quantization step size used for quantizing the high-pass signal. This leads to the following quantizer selection for the specified dyadic decomposition structure of a group of 2ⁿ pictures: QP_L(Π) be the quantization parameter used for coding the low-pass picture obtained after the nth decomposition stage.

The quantization parameters used for coding the high-pass pictures obtained after the ith decomposition stage are calculated as follows:

QPHUJ = QP_L(n) + 3 * (n + 2

The quantization parameter QPi_ntr_a(i) used for quantizing the intra prediction signals of the motion-field description M(i-i)o which are used in the ith decomposition stage are derived from the quantization parameter QP_Hi for the high- pass pictures produced in this decomposition stage, by means of the following equation:

QPiNTRA(i) = QPH(D ^~ δ.

The motion-field descriptions Mι₀ and M^ used in the prediction step, on the one hand, and in the update step, on the other hand, are preferably estimated independently of one another. The process of estimating the motion-field description Mio used in the prediction step will be described below. The process of estimating Mn is obtained by exchanging the original pictures A and B and by eliminating the INTRA mode from the set of possible macroblock modes.

Pictures A and B shall be given, which are either original pictures or are pictures representing low-pass signals that have been produced in a previous analysis stage. In addition, the corresponding arrays of luma samples a[] and b[] are provided. The motion description Mio is estimated as follows in a macroblock-wise manner: For all possible macroblock and sub-macroblock partitions of a macroblock i within picture B, the associated motion vectors

mi = [m_x, m_y]

are determined by minimizing the Lagrangian functional

m, =argmir^ ^ ^',!!!)+λ• R(i, )} , me5

wherein the distortion term is given as follows:

Here, S specifies the motion-vector search range within the reference picture A. P is the area covered by the macroblock partition or sub-macroblock partition considered. R (i, m) specifies the number of bits required to transmit all components of the motion vector m, wherein m is a fixed Lagrangian multiplier.

The motion search initially proceeds across all motion vectors in the given search range S which have an integer- sample accuracy. Using the best integer motion vector, the 8 surrounding half-sample accurate motion vectors are then tested. Eventually, the 8 surrounding quarter-sample accurate motion vectors are tested using the best half-sample accurate motion vectors. For the half- and the quarter- sample accurate motion-vector improvements, the term

a[x - ιtι_x, y - m_y]

is interpreted as the interpolation operator.

The mode decision for the macroblock mode and the sub- macroblock mode is principally made in line with the same approach. The mode pi, which minimizes the following La- grangian functional, is selected from a given set of possible macroblocks or sub-macroblock modes s-_mθde:

p, = arg min {D_SAD(i,p) + λ -R(i,p)} . P^Smoάe The distortion terijn is given as follows:

D_SΛD&P * ,

wherein P specifies the macroblock or sub-macroblock range, and wherein n[p, x, y] is the motion vector associated with the macroblock or sub-macroblock mode p and with the macroblock partition or sub-macroblock partition which includes the luma position (xy) .

The rate term R (ip) represents the number of bits associated with the selection of the coding mode p. For the motion-compensated coding modes, coding mode p includes the bits for the macroblock mode (if applicable) , the sub- macroblock mode(s) (if applicable) and the motion vector (s) . For the intra mode, the coding mode p includes the bits for the macroblock mode and the arrays of quantized luma and chroma transform coefficient levels.

The set of possible sub-macroblock modes is given by:

{P_8x8, P_8x4, P_4x8, P_4x4 } .

The set of possible macroblock modes is given by:

{P_16xl6, P_16x8, P_8xl6, P_8x8, INTRA},

wherein the intra mode is used only if a motion-field description Mo used for the prediction step is estimated.

The Lagrangian multiplier λ is set to comply with the following equation, depending on the base layer quantization parameter for the high-pass picture (s) Q_Phi of the decomposition stage for which the motion field ^"is estimated:

λ = 0.33 • 2^Λ(QP_Hi/3-4) .

The fundamental tw'o-channel analysis filterbank decomposes two input pictures A, B into a low-pass picture L and a high-pass picture H. In accordance with the notation used in this document, the low-pass picture L uses the coordi- nate system of the original picture A. Thus, if a perfect (error-free) motion compensation is assumed, pictures A and L are identical.

The decomposition structure set forth in Fig. 4 is obtained if in all decomposition stages the even input pictures are treated as input pictures A at temporal sampling positions 0, 2, 4, and if the odd input pictures are treated as input pictures B at sampling positions 1, 3, 5. This scheme enables optimum temporal scalability. However, the temporal distance between the pictures decomposed in each two- channel analysis filterbank is increased by a factor of 2 from one decomposition stage to the next. In addition, it is known that the efficiency factor of the motion- compensated prediction decreases as the temporal distance between the reference picture and the picture to predict increases .

It is preferred, in accordance with the invention, to realize decomposition schemes, wherein the temporal distance between the pictures decomposed by the two-channel filter- bank is increased by a factor smaller than 2 from one decomposition stage to the next. However, these schemes do not provide the feature of optimum temporal scalability, since the distances between adjacent low-pass filters vary in most decomposition stages.

In accordance with the invention, the decomposition scheme depicted in Fig. 8 is used, of which it is assumed that it enables a reasonable compromise between temporal scalability and coding efficiency. The sequence of the original pictures is treated as a sequence of input pictures A, B, A, B, A, B, ... A, B. Thus, this scheme provides a stage with optimum temporal scalability (equal distance between the low-pass pictures) The sequence of low-pass pictures used as an input signal into all following decomposition stages are treated as sequences of input pictures B, A, A, B, B, A ... A, B, whereby the distances between the low-pass pic- tures decomposed are kept small in the following two- channel analysis scheme, as is shown in Fig. 8.

In a preferred embodiment of the present invention, the motion fields io and u used in the prediction step and the update step, respectively, are estimated and coded independently of one another. This may possibly result in that the bit rate required for transmitting the motion parameters is increased. However, there is possibly also a negative influence on the connectivity of these two motion fields, which may have an important impact on the coding efficiency of the subband approach. Therefore it is preferred, for the purpose of improving the coding efficiency, for the motion fields M_ix used in the update step to not be estimated and coded independently of one another, but to be derived from the motion fields Mi₀ calculated in the prediction steps, to the effect that they still represent a block-wise motion compatible with the H.264/AVC specification. As an additional side effect, this will also reduce the complexity required for the update step.

The current analysis/synthesis structure represents a lifting representation of the simple hair filters in a preferred embodiment of the present invention. In a different embodiment, this scheme is extended to include a lifting representation with biorthogonal 5/3 filters, which leads to a bidirectional motion-compensated prediction. The most preferred approach is to adaptively switch between the lifting representations of the hair filters and the 5/3 filters on a block basis, it preferably being possible to also employ the motion-compensated prediction as is specified for B slices in H.264/AVC.

In addition, it is, preferred to adaptively select the size of the groups of pictures for the temporal subband decomposition.

Since the use of several reference pictures has signifi- cantly improved the performance of prediction-based video coding schemes, an inclusion of this approach into a sub- band scheme is also preferred.

In addition, the use of a suitable bit-association algo- rithm which reduces the SNR variations within a group of pictures that may occur with certain test sequences is preferred.

In addition, it is preferred to use new techniques for transform coefficient coding which improve SNR scalability and provide an additional degree of spatial scalability.

Depending on the circumstances, the inventive method for coding, decoding, filtering and/or inverse filtering may be implemented in hardware or in software. The implementation may be effected on a digital storage medium, in particular a disc or CD with electronically readable control signals, which may cooperate with a programmable computer system such that the method is performed. Generally, the invention thus also consists in a computer-program product having a machine-readable carrier-stored program code for performing the inventive method, if the computer-program product runs on a computer. In other words, the invention may thus be realized on a computer program having a program code for performing the method, if the computer program runs on a computer.

Claims

1. Apparatus of coding a group of successive pictures, the group comprising at least a first, a second, a third and a fourth picture, the apparatus comprising: a second filter level (12) for producing a first high- pass picture of the second level and a first low-pass picture of the second level from the first and second pictures, and for producing a second high-pass picture of the second level and a second low-pass picture of the second level from the third and the fourth pictures; a first filter level (20) for producing a first high- pass picture of the first level and a first low-pass picture of the first level from the first and second low-pass pictures of the second level; and first further-processing means (26) for further processing the first high-pass picture of the first level and the first low-pass picture of the first level so as to obtain a coded picture signal (28), the first further-processing means (26) including a quantizer having a quantizer step size.

2. Apparatus as claimed in claim 1, wherein the second filter level (12) or the first filter level (20) is configured in accordance with a lifting scheme.

3. Apparatus as claimed in claim 1 or 2, wherein the second filter level (12) or the first filter level (20) is configured to produce a low-pass picture such that it includes similarities from two filtered pictures, and to produce a high-pass picture such that it includes differences of two filtered pictures.

4. Apparatus as claimed in any of the previous claims, wherein the second filter level comprises: a first predictor (67) for producing a first predic- tion picture for the first picture on the basis of the second picture; a second predictor (68) for producing a second prediction picture for the second picture on the basis of the first picture; a first combiner (69) for combining the first prediction picture with the first picture; and a second combiner (70) for combining the second prediction picture with the second picture.

5. Apparatus as claimed in any of the previous claims, wherein the first filter means comprise: a first predictor (60) for producing a first prediction picture for the first low-pass picture of the second level on the basis of the second low-pass pic- ture of the second level; a second predictor (61) for producing a second prediction picture for the second low-pass picture of the second level on the basis of the first low-pass pic- ture of the second level; a first combiner (44) for combining the first prediction picture with the first picture; and a second combiner (46) for combining the second prediction picture with the second picture.

6. Apparatus as claimed in claim 4 or 5, wherein the first combination means ( 44 , 69) are a subtracter, and the second combination means (70, 46) are an adder, and wherein an output signal of the first combiner (44, 69) is a high-pass picture, and wherein an output signal of the second combiner is a low-pass picture.

7. Apparatus as claimed in any of the previous claims, wherein the first filter level or the second filter level is configured to perform a motion compensation.

8. Apparatus as claimed in claim 4 or 5, wherein the first predictor is configured to perform a backward-motion compensation, and wherein the second predictor is configured to perform a forward-motion compensation.

9. Apparatus as claimed in claim 4 or 5, wherein the first or second predictors are configured to produce a prediction picture using prediction information, the prediction information being transferable in a side channel.

10. Apparatus as claimed in any of the previous claims, further comprising: a time scalability control (120) for entering a fur- ther-processed version of the first high-pass picture of the first level and of the first low-pass picture of the first level in a base scaling layer, and for entering a further-processed version of the first high-pass picture of the second level and of the second high-pass picture of the second level in an extension scaling layer.

11. Apparatus as claimed in any of the previous claims, further comprising: signal/noise distance scaling processing means comprising: means for entering a further-processed version of the first high-pass picture of the second level and of the second high-pass picture of the second level as well as a further-processed version of the first high-pass picture of the first level, and a further-processed version of the first low-pass picture of the first level into a base scaling layer, the base scaling layer having at least one quantizer step size associated with it; means (140) for inversely further processing the further-processed versions contained in the base scaling layer, the means for inversely further processing including a requantizer configured to operate on the ba- sis of the quantizer step size associated with the base scaling layer; means (142) for subtracting the inversely further- processed versions of corresponding, originally pro- duced high-pass and/or low-pass pictures so as to obtain component-wise quantization error signals; and means (144) for further processing the quantization error signals using a further quantizer step size smaller than a quantizer step size associated with the base scaling layer, and for entering the further- processed versions of the quantization error signals into an extension layer having the further quantizer step size associated with it.

12. Apparatus as claimed in any of the previous claims, which further¹ comprises a second further-processing unit (18) for further processing the first high-pass picture of the second level and the second high-pass picture of the second level, and wherein the first further-processing means (26a, 26b) and the second further-processing means (18) are configured to perform further processing operations in accordance with the H.264/AVC standard.

13. Apparatus as claimed in any of the previous claims, wherein the first, second, third and fourth pictures are first, second, third, and fourth low-pass pictures, respectively, of a third level which have been produced from a group of eight pictures using third filter means.

14. Apparatus as claimed in claim 13, wherein the eight pictures in turn are eight low-pass pictures that have been produced from a group of sixteen pictures using a fourth filter level.

15. Apparatus as claimed in any of the previous claims, wherein the second filter level or the first filter level include means for delaying a picture by the duration of a picture, and means (42a, 42b) for eliminating a picture from a corresponding signal.

16. Apparatus as claimed in any of the previous claims, wherein the group of pictures comprising the first, second, third and fourth pictures is derived from an original group of pictures, wherein the second picture appears, in terms of sequence, before the first picture, the group being producible from the original group by means of switching means.

17. Apparatus for' decoding a coded signal which is derived by filtering a group of successive pictures including at least one first, one second, one third or one fourth picture, by filtering for producing a first high-pass picture of a second level and a first low- pass picture of the second level from the first and second pictures and for producing a second high-pass picture of the second level and a second low-pass picture of the second level from the third and fourth pictures, by filtering for producing a first high-pass picture of the first level and a first low-pass picture of the first level from the first and second low- pass pictures of the second level, and by further processing the first high-pass picture of the first level and the first low-pass picture of the first level so as to obtain a coded picture signal, the apparatus comprising: means (30) for inversely further processing a further- processed version of the first high-pass picture of the first level and a further-processed version of the first low-pass picture of the first level using a quantizer step size so as to obtain a reconstructed version of the first high-pass picture and a recon- structed version of the first low-pass picture; and an inverse filtering level (32) for inversely filtering the reconstructed version of the first high-pass picture and the reconstructed version of the first low-pass picture to obtain a reconstructed version of a first low-pass picture and of a second low-pass picture of a level which is one up.

18. Apparatus as claimed in claim 17, further comprising: further inverse further-processing means for inversely further-processing a first high-pass picture of the second level , and a second high-pass picture of the second level to obtain a reconstructed version of the first high-pass picture of the second level and of the second high-pass picture of the second level; and a further inverse filtering level for inversely filtering the reconstructed version of the first low-pass picture of the next level up and of the second low- pass picture of the next level up with the reconstructed version of the first high-pass picture of the second level and the reconstructed version of the second high-pass picture of the second level to obtain reconstructed versions of four pictures of a level which is one up, which represent the original pictures or low-pass pictures.

19. Apparatus as claimed in claim 17 or 18, wherein the coded signal is a scaled signal having a base layer and an extension layer, the base layer comprising pictures that have been further processed using a quan- tizer step size, and quantization error signals that have been further processed using a further quantizer step size, for the pictures, the apparatus further comprising: means for inversely further processing the quantization error signals to obtain a reconstructed version of the quantization error signals; and means for adding the reconstructed version of the quantization error signals to a reconstructed version of the signals of the base layer to obtain a decoded output signal which has, solely due to the base layer, a higher quality than a decoded output signal.

20. Method of coding a group of successive pictures, the group comprising at least one first, one second, one third and one fourth picture, the method comprising: producing (12) a first high-pass picture of the second level and a first low-pass picture of the second level from the first and second pictures, and producing a second high-pass picture of the second level and a second low-pass picture of the second level from the third and the fourth pictures; producing (20) a first high-pass picture of the first level and a first low-pass picture of the first level from the first and second low-pass pictures of the second level; and further processing (26) the first high-pass picture of the first level and the first low-pass picture of the first level so as to obtain a coded picture signal (28), the first further-processing means (26) including a quantizer having a quantizer step size.

21. Apparatus for decoding a coded signal which is derived by filtering a group of successive pictures including at least one first, one second, one third or one fourth picture, by filtering for producing a first high-pass picture of a second level and a first low- pass picture of the second level from the first and second pictures and for producing a second high-pass picture of the second level and a second low-pass picture of the second level from the third and fourth pictures, by filtering for producing a first high-pass picture of the first level and a first low-pass pic- ture of the first level from the first and second low- pass pictures of the second level, and by further processing the first high-pass picture of the first level and the first low-pass picture of the first level so as to obtain a coded picture signal, the apparatus comprising: inversely further processing (30) a further-processed version of the first high-pass picture of the first level and a 'further-processed version of the first low-pass picture of the first level using a quantizer step size so as to obtain a reconstructed version of the first high-pass picture and a reconstructed ver- siσn of the first low-pass picture; and inversely filtering (32) the reconstructed version of the first high-pass picture and the reconstructed version of the first low-pass picture to obtain a recon- structed version of a first low-pass picture and of a second low-pass picture of a level which is one up.

22. Filtering device for filtering first and second pictures, comprising: a backward predictor (60) for producing a first prediction picture for the first picture on the basis of the second picture; and a forward predictor (61) for producing a second prediction picture for the second picture on the basis of the first picture; a subtracter (44) for subtracting the first prediction picture from the first picture to obtain a high-pass picture which includes differences between the first and second pictures; and an adder (46) for adding the prediction picture to the second picture to obtain a low-pass picture which represents similarities between the first and second pictures .

23. Inverse filtering device for inversely filtering a high-pass picture and a low-pass picture to obtain first and second pictures, the inverse filtering device comprising: a forward pre'dictor for producing a first prediction picture for the low-pass picture from the high-pass picture; a subtracter (101) for subtracting the prediction picture from the low-pass picture; a backward predictor (60) for producing a prediction picture for the high-pass picture using an output sig- nal from the subtracter; and an adder (102) for adding the high-pass picture and the second prediction picture obtained from the backward predictor (60), wherein an output signal from the adder (102) represents the first picture, whereas an output signal from the subtracter (101) represents the second picture.

24. Method of filtering first and second pictures, comprising: producing (60) a first prediction picture for the first picture on the basis of the second picture; and producing (61) a second prediction picture for the second picture on the basis of the first picture; subtracting (44) the first prediction picture from the first picture to obtain a high-pass picture which includes differences between the first and second pictures; and adding (46) the prediction picture to the second picture to obtain a low-pass picture which represents similarities between the first and second pictures.

25. Method of inversely filtering a high-pass picture and a low-pass picture to obtain first and second pictures, the method comprising: producing a first prediction picture for the low-pass picture from the high-pass picture; subtracting (101) the prediction picture from the low- pass picture; producing (60) a second prediction picture for the high-pass picture using an output signal from the subtracter; and adding (102) the high-pass picture and the second pre- diction picture, wherein an output signal from the adder (102) represents the first picture, whereas an output signal from the subtracter (101) represents the second picture.

26. Computer program having a program code for performing the method as claimed in any of claims 20, 21, 24 or 25, if the computer program runs on a computer.