GB2501518A

GB2501518A - Encoding and Decoding an Image Comprising Blocks of Pixels Including Selecting the Prediction Mode of Prediction Units

Info

Publication number: GB2501518A
Application number: GB1207316.9A
Authority: GB
Inventors: Fabrice Le Leannec; Sebastien Lasserre
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2012-04-27
Filing date: 2012-04-27
Publication date: 2013-10-30
Also published as: GB201207316D0; WO2013160277A1

Abstract

An image in a digital video sequence comprising blocks of pixels is encoded and decoded. The encoding comprises a step of subtracting (1393) a prediction image (1391) from an original image (1310) to obtain a residual image (1394). The prediction image (1391) comprises a set of prediction units (1410 in fig. 14), each prediction unit being a best predictor according to a given criterion (13C) between predictors obtained by motion estimation (INTER) and predictors obtained using information from a corresponding base layer (1320) image (Intra BL, Base Mode). Corresponding decoding is also disclosed.

Description

A METHOD, DEVICE, COMPUTER PROGRAM, AND INFORMATION

STORAGE MEANS FOR ENCODING AND DECODING AN IMAGE

COMPRISING BLOCKS OF PIXELS

FIELD OF THE INVENTION

The invention relates to the field of scalable video coding, in particular to scalable video coding that would extend the High Efficiency Video Coding (HEVC) standard. The invention concerns a method, device, computer program, and information storage means for encoding and decoding an image comprising blocks of pixels, said image being comprised e.g. in a digital video sequence.

BACKGROUND OF THE INVENTION

Video coding is a way of transforming a series of video images into a compact digitized bit-stream so that the video images can be transmitted or stored. An encoding device is used to code the video images, with an associated decoding device being available to reconstruct the bit-stream for display and viewing. A general aim is to form the bit-stream so as to be of smaller size than the original video information.

This advantageously reduces the capacity required of a transfer network, or storage device, to transmit or store the bit-stream code.

Common standardized approaches have been adopted for the format and method of the coding process, especially with respect to the decoding part. One of the more recent agreements is Scalable Video Coding (SVC) wherein the video image is split into smaller sections (called macroblocks or blocks) and treated as being comprised of hierarchical layers. The hierarchical layers include a base layer, equivalent to a collection of images (or frames) of the original video image sequence, and one or more enhancement layers (also known as refinement layers). SVC is the scalable extension of the H.264/AVC video compression standard. A further video standard being standardized is HEVC, wherein the macroblocks are replaced by so-called Coding Units and are partitioned and adjusted according to the characteristics of the original image segment under consideration. This allows more detailed coding of areas of the video image which contain relatively more information and less coding effort for those areas with fewer features.

The video images were originally processed by coding each macroblock individually, in a manner resembling the digital coding of still images or pictures. Later coding models allow for prediction of the features in one frame, either from neighbouring macroblocks, or by association with a similar macroblock in a neighbouring frame. This allows use of already available coded information, thereby shortening the amount of coding bit-rate needed overall. Differences between the source area and the area used for prediction are captured in a residual set of values which themselves are encoded in association with the code for the source area. Many different types of predictions are possible. Effective coding choses the best model to provide image quality upon decoding, while taking account of the bit-stream size each model requires to represent an image in the bit-stream. A trade-off between the decoded picture quality and reduction in required code, also known as compression of the data, is the overall goal.

The context of the invention is the design of the scalable extension of HEVC. HEVC scalable extension will allow codingldecoding a video made of multiple scalability layers.

These layers comprise a base layer that is compliant with standards such as HEVC, H.264/AVC or MPEG2, and one or more enhancement layers, coded according to the future scalable extension of HEVC.

It is known that to ensure good scalable compression efficiency, one has to exploit redundancy that lies between the base layer and the enhancement layer, through so-called inter-layer prediction techniques. In case of INTER pictures, one has to selectively predict successive picture blocks through intra-layer INTER prediction, intra-layer spatial INTRA prediction, inter-layer INTER prediction and inter-layer INTRA prediction. In classical scalable video codecs (encoder-decoder pairs), this takes the form of block prediction choice, one block after another, among the above mentioned available prediction modes, according to a rate distortion criteria. Each reconstructed block serves as a reference to predict subsequent blocks. Differences are noted and encoded as residuals. Competition between the various possible encoding mechanisms takes account of both the type of encoding used and the size of the bit-stream resulting from each type. A balance is achieved between the two considerations.

In general, the more information can be coded, and subsequently extracted, the better the result achieved on playback (decoding) of the video stream.

By coding sympathetically to the characteristics of the different blocks, quality can be maintained while bit-stream size is managed.

A problem of the coding process is to manage the bit-stream data produced. The more concise the coding, the more efficient the process is deemed to be.

SUMMARY OF THE INVENTION

This problem is addressed by an embodiment of the invention by provision of a method for encoding at least one image of scalable video data comprising pixels, the method comprising the steps of: -subtracting a prediction image from an original image to obtain a residual image, said prediction image comprising a set of prediction units, each prediction unit having a prediction mode selected, according to a given criterion, from a plurality of prediction modes including a motion estimation prediction mode and at least one prediction mode using information from a corresponding base layer image; -transforming pixels values of the residual image to obtain transformed coefficients; -quantizing at least one of the transformed coefficients to obtain quantized symbols; -encoding the quantized symbols into encoded data.

The method according to a first embodiment of the invention allows for the prediction of units of a given image to be assessed with respect to the best coding method for the associated particular data set, including the possibility of using temporal prediction according to the HEVC standard, without immediately continuing the coding process.

The end coding into bit-stream data is done on a global level for an entire image but by using the most relevant coding method for each locally defined unit. Such an approach helps to reduce the complexity of the coding process, speeds it up, and thereby manages the bit-stream data produced. An additional advantage of such an approach H is that it provides a codec architecture where all the picture prediction process can be done before the commencement of texture coding. The global image is processed through texture coding to obtain an up-sampled version of a base layer. The residual associated with the global image is altered into a prediction map of the original image, comprising prediction units and coding units adjusted to be most suitable for the data content of each unit.

The advantage of such approach is that the whole residual data picture is known by the texture coding step, before starting to encode it. This knowledge is helpful to obtain good coding.

In a further embodiment of the invention, the transformed coefficients each have a coefficient type.

The transformation process of coding is dependent on the partitioning.

The advantage of this characteristic is that is helps reducing the bitrate used to signal the size of transform units. In a further embodiment of the invention, the obtained transformed coefficients each have a coefficient type.

In a further embodiment of the invention, the step of transforming further comprises: -determining transform units corresponding to transforms to be applied to pixel values of the residual image to obtain the transformed coefficients; In a further embodiment of the invention, the determination of transform units takes into account the pixel values of the residual image.

The advantage is to adapt the transform size to the content, so as to make it compact with the energy contained in the residual data as much as possible.

In a further embodiment of the invention, the transform units are embedded in the pixel area corresponding to the prediction units.

The main advantage of this characteristic is that is helps reducing the bitrate used to signal the size of transform units.

In a further embodiment of the invention, the prediction mode using information from a corresponding base layer image comprises intra base layer prediction andfor base mode prediction.

The advantage of this characteristic is that it exploits the redundancy that exists between the base and the enhancement layer, thus improving the compression efficiency in the coding of the enhancement layer.

In a further embodiment of the invention, the method further comprises the step of: -signaling the prediction mode used.

The advantage of this characteristic is that it makes the bit-stream decodable.

In a further embodiment of the invention, the intra base layer prediction mode comprises predicting a prediction unit of an enhancement image with a spatially corresponding unit in an up-sampled corresponding base layer image.

In a further embodiment of the invention, the base mode prediction comprises predicting an enhancement prediction unit from a spatially corresponding unit of a base mode prediction image.

In a further embodiment of the invention, the intra base layer prediction and/or base mode prediction are substituted for special prediction modes comprised in the HEVC standard.

The advantage of this characteristic is that the prediction step, as applied on a given block of the picture to Predict? does not depend at all on data already processed in the considered picture. Therefore, the entire picture can be predicted, without the need to perform texture coding and decoding of any previous block in the current picture.

In a further embodiment of the invention, the method comprises the further steps of: -inserting the prediction image resulting from intra layer prediction and/or base mode prediction into a list of reference pictures; -using the prediction image resulting from intra layer prediction and/or base mode prediction in the temporal prediction of an enhancement picture.

The advantage of this characteristic is to benefit from the INTER coding tools contained in the HEVC video compression standard, to further improve the compression efficiency in the coding of the enhancement layer.

In a further embodiment of the invention, the method comprises the further step of: -signaling the use of intra layer prediction and/or base mode by means of a dedicated reference picture index.

The advantage of this characteristic is that it helps signaling the inter-layer prediction mode used, in a way that is compliant with the use of HEVC temporal prediction tools.

In a further embodiment of the invention, the step of quantizing further comprises the step of: -quantizing at least one of the transformed coefficients based on a quantizer that depends on a given probabilistic model.

It should be noted that it is also possible for the quantizer to depend on a given statistical distribution.

It should also be considered that the quantizing may be expressed in terms of quantizing at least one of the transformed coefficients based on an estimated value representative of a ratio between a distortion variation by encoding a coefficient having the given type and a rate increase resulting from encoding said coefficient.

The quantizing of at least one of the transformed coefficients is based on a quantizer or quantifier that depends on a given statistical distribution (or probabilistic model). One example of such a statistical distribution is a Generalised Gaussian Distribution, The quantifier used may be e.g. rate distortion optimal with respect to the chosen statistical distribution.

In a further embodiment of the invention: -the quantizing step is performed by a quantization method comprising determining, for at least one coefficient type, a probabilistic model based on statistical information computed on the residual image for that coefficient type; determining, for the at least one coefficient type, an average distortion target for the encoding of coefficients with the at least one coefficient type; selecting, for the at least one coefficient type, a quantizer from a pre-defined set of quantizers.

In a further embodiment of the invention: -the quantizing step is performed by a quantization method comprising determining, for at least one coefficient type, a probabilistic model based on statistical information computed on the residual image for that coefficient type; determining, for the at least one coefficient type, an average distortion target for the encoding of coefficients with the at least one coefficient type; selecting, for the at least one coefficient type, a quantizer from a pre-defined set of quantizers according to the determined probabilistic model and the determined distortion target.

In a further embodiment of the invention, symbols issued from the quantizing step undergo an entropy coding step, which takes into account pre-computed probability values associated to each interval of a selected rate distortion optimal quantizer.

The advantage of these characteristics is that they facilitates performing rate distortion optimal encoding of the transform coefficients issued from the DCT transform step.

DCT coefficients are quantized 50E to produce quantised DCT Xo values 56, and entropy coded 50F to produce a set of values H(X0) 57. The entropy coder used consists of a simple1 non-contextual, non-adaptive arithmetic coder. The arithmetic coding employs, for each OCT channel, a set of fixed probabilities, respectively associated to each pre-computed quantization interval. Therefore, these probabilities are entirely calculated off-line, together with the rate distortion optimal quantifiers.

Probability values are never updated during the encoding or decoding processes, and are fixed for the whole picture being processed. In particular, this ensures the spatial random access feature, and also makes the decoding process highly parallelizable.

In another aspect of the invention there is provided a method comprising: -adding a prediction image to a decoded enhancement layer residual image to obtain an image, said prediction image comprising a set of prediction units, each prediction unit having a prediction mode selected, according to a given criterion, from a plurality of prediction modes including a motion estimation prediction mode and at least one prediction mode using information from a corresponding base layer image; -said enhancement layer residual image comprising quantized symbols, the decoding of the image comprising inverse quantizing these quantized symbols to obtain transformed An inverse transformation step follows the inverse quantization step.

The advantage of this characteristic is that is allows predicting a whole enhancement picture to decode, before starting the texture decoding process for that enhancement picture.

In a further embodiment of the invention, the prediction modes use information from a corresponding base layer encoded picture comprise intra base layer prediction and/or base mode prediction.

The advantage of this characteristic is that it exploits the redundancy that exists between the base and the enhancement layer, thus improving the compression efficiency in the coding of the enhancement layer. In a further embodiment of the invention, the transformed coefficients are associated to transform units corresponding to transforms with different sizes, applied to pixel values of a residual image.

The advantage is to adapt the transform size to the content, so as to make it compact the energy contained in the residual data as much as possible.

In a further embodiment of the invention, the determination of transform units takes into account the pixel values of the residual image.

The advantage is to adapt the transform size to the content, in a way that is optimal from the rate distortion viewpoint.

In a further embodiment of the invention, the transform units are embedded in the pixel H area corresponding to the prediction units.

The main advantage of this characteristic is that is helps reducing the bitrate used to H signal the size of transform units.

In a further embodiment of the invention: -each transformed coefficient comprises a coefficient type and; -the inverse quantization step comprises determinating of a set of quantizers, comprising selection of an optimal quantizer for at least one coefficient type by use of a probabilistic model parameters representative of said probabilistic model for at least one coefficient type being obtained before applying the decoding process.

The advantage of this characteristic is that it allows performing rate distortion optimal decoding of the transform coefficients issued from the entropy decoding step.

In another aspect of the invention, the quaritised symbols have undergone an entropy decoding step, which takes into account pre-computed probability values associated to each interval of a selected rate distortion optimal quantizer.

In another aspect of the invention there is provided a computer program comprising instructions for carrying out each step of the method according to any one of the preceding claims when the program is loaded and executed by a programmable means.

In another aspect of the invention there is provided information storage means readable by a computer or a microprocessor storing instructions of a computer program, wherein it makes it possible to implement a method for optimizing access to a wireless medium.

H

In another aspect of the invention there is provided a device for encoding at least one image comprising pixels, the device comprising: -means for subtracting a prediction image from an original image to obtain a residual image, said prediction image comprising a set of prediction units, each prediction unit having a prediction mode selected, according to a given criterion, from a plurality of prediction modes including a motion estimation prediction mode and at least one prediction mode using information from a corresponding base layer image; -means for transforming pixels values of the residual image to obtain transformed coefficients; -means for quantizing at least one of the transformed coefficients to obtain quantized symbols; -means for encoding the quantized symbols into encoded data.

In another aspect of the invention there is provided a device for decoding at least one image comprising pixels, the device comprising: -means for adding a prediction image to a decoded enhancement layer residual image to obtain an image, said prediction image comprising a set of prediction units, each prediction unit having a prediction mode selected, according to a given criterion, from a plurality of prediction modes including a motion estimation prediction mode and at least one prediction mode using information from a corresponding base layer image; -said enhancement layer residual image comprising quantized symbols, the decoding of this image comprising inverse quantizing these quantized symbols.

In another aspect of the invention there is provided a method for encoding at least one image comprising pixels, substantially as hereinbefore described with reference to, and shown in Figs. 3, 5,9, 13, 15, 18 or 19 of the accompanying drawings In another aspect of the invention there is provided a method for decoding at least one image comprising pixels, substantially as hereinbefore described with reference to, and shown in, Figs. 4,6, 16 or2O of the accompanying drawings In another aspect of the invention there is provided a device for encoding at least one image comprising pixels, substantially as hereinbefore described with reference to, and shown in, Figs. 3, 5, 9, 13, 15, 18 or 19 of the accompanying drawings In another aspect of the invention there is provided a device for decoding at least one image comprising pixels, substantially as hereinbefore described with reference to, and shown in, Figs. 4, 6, 16 or 20 of the accompanying drawings

BRIEF DESCRIPTION OF THE FIGURES

The invention will now be further elucidated by reference to the figures.

Reference numerals presented in the figures are maintained across all figures when referring to the same object or concept.

Fig. I illustrates an example of a device for encoding or decoding images, capable of implementing one or more embodiments of the present invention.

Fig. 2 illustrates all-INTRA configuration for scalable video coding (SVC).

Fig. 3 illustrates a scalable video encoder architecture in all-INTRA mode according to at least one embodiment.

Fig. 4 illustrates a scalable video decoder architecture, associated with the scalable video encoder architecture for all-INTRA mode (as shown in Fig. 3) according to at least one embodiment.

Fig. 5 illustrates the encoding process associated with the residuals of an enhancement layer according to at least one embodiment Fig 6 illustrates the decoding process consistent with the encoding process of Fig. 5 according to at least one embodiment.

Fig. 7 illustrates a low-delay temporal coding structure according to the HEVC standard.

Fig. 8 illustrates a random access temporal coding structure according to the HEVC standard.

Fig. 9 illustrates a standard video encoder, compliant with the HEVC standard for video compression.

Fig. 10 illustrates a block diagram of a scalable video encoder, compliant with the HEVC standard in the compression of the base layer.

Fig. 11 illustrates a block diagram of a decoder, compliant with standard HEVC or H.264/AVC and reciprocal to the encoder of Fig.9.

Fig. 12 illustrates a block diagram of a scalable decoder, compliant with standard HEVC or H.264/AVC in the decoding of the base layer, and reciprocal to the encoder of Fig.10.

Fig. 13 illustrates an embodiment of encoder architecture according to an embodiment of the invention.

Fig. 14 illustrates coding units and prediction unit concepts specified in the HEVC standard.

Fig. 15 illustrates prediction modes suitable for the scalable codec architecture, according to an embodiment of the invention.

Fig. 16 illustrates an embodiment of architecture of a scalable video decoder according to an embodiment of the invention.

Fig. 17 illustrates an example of the prediction information up-sampling process according to an embodiment of the invention.

Fig. 15 illustrates the construction of a Base Mode prediction picture.

Fig. 19 illustrates an algorithm according to an embodiment of the invention used to encode an INTER picture.

Fig. 20 illustrates an algorithm according to the invention used to decode an INTER picture, complementary to the encoding algorithm of Fig. 19.

Fig. 1 shows a device 100, in which one or more embodiments of the invention may be implemented, illustrated arranged in cooperation with a digital camera 101, a microphone 124 (shown via a card input/output 122), a telecommunications network 340 and a disc 116, comprising a communication bus 102 to which are connected: * a central processing CPU 103, for example provided in the form of a microprocessor * a read only memory (ROM) 104 comprising a program 104A whose execution enables the methods according to an embodiment of the invention. This memory 104 may be a flash memory or EEPROM; * a random access memory (RAM) 106 which, after powering up of the device 100, contains the executable code of the program 104A necessary for the implementation of an embodiment of the invention; This RAM memory 106, being random access type, provides fast access compared to ROM 104. In addition the RAM 106 stores the various images and the various blocks of pixels as the processing is carried out on the video sequences (transform, quantization, storage of reference images etc.); * a screen 108 for displaying data, in particular video and/or serving as a graphical inter-face with the user, who may thus interact with the programs according to an embodiment of the invention, using a keyboard 110 or any other means e.g. a mouse (not shown) or pointing device (not shown); * a hard disk 112 or a storage memory, such as a memory of compact flash type, able to contain the programs of an embodiment of the invention as well as data used or produced on implementation of an embodiment of the invention; * an optional disc drive 114, or another reader for a removable data carrier, adapted to receive a disc 116 and to read/write thereon data processed, or to be processed, in accordance with an embodiment of the invention and; * a communication interface 118 connected to a telecommunications network 34 * connection to a digital camera 101 The communication bus 102 permits communication and interoperability between the different elements included in the device 100 or connected to it. The representation of the communication bus 102 given here is not limiting. In particular, the CPU 103 may communicate instructions to any element of the device 100 directly or by means of another element of the device 100.

The disc 116 can be replaced by any information carrier such as a compact disc (CD-ROM), either writable or rewritable, a ZIP disc or a memory card Generally, an information storage means, which can be read by a micro-computer or microprocessor, which may optionally be integrated in the device 100 for processing a video sequence, is adapted to store one or more programs whose execution permits the implementation of the method according to an embodiment of the invention.

The executable code enabling the coding device to implement an embodiment of the invention may be stored in ROM 104, on the hard disc 112 or on a removable digital medium such as a disc 116.

The CPU 103 controls and directs the execution of the instructions or portions of software code of the program or programs of an embodiment of the invention, the instructions or portions of software code being stored in one of the aforementioned storage means. On powering up of the device 100, the program or programs stored in non-volatile memory, e.g. hard disc 112 or ROM 104, are transferred into the RAM 106, which then contains the executable code of the program or programs of an embodiment of the invention, as well as registers for storing the variables and parameters necessary for implementation of an embodiment of the invention.

It should be noted that the device implementing an embodiment of the invention, or incorporating it, may be implemented in the form of a programmed apparatus. For example, such a device may then contain the code of the computer program or programs in a fixed form in an application specific integrated circuit (ASIC).

The device 100 described here and, particularly, the CPU 103, may implement all or part of the processing operations described below.

Fig. 2 illustrates the structure of a scalable video stream 20, when all pictures are encoded in INTRA mode. As shown, an all-INTRA coding structure consists of a series of pictures which are encoded independently from each other. The base layer 21 of the scalable video stream 20 is illustrated at the bottom of the figure.

In this base layer, each picture is INTRA coding and is usually referred to as an "I' picture. INTRA coding involves predicting a macroblock or block from its directly neighbouring blocks within a single image or frame.

A spatial enhancement layer 22 is encoded on top of the base layer 21. It is illustrated at the top of Fig. 2. This spatial enhancement layer 22 introduces some spatial refinement information over the base layer. In other words, the decoding of this spatial layer leads to a decoded video sequence that has a higher spatial resolution than the base layer. The higher spatial resolution adds to the quality of the reproduced images.

As illustrated in the figure, each enhancement picture, denoted an El' picture, is intra coded. An enhancement INTRA picture is encoded independently from other enhancement pictures. It is coded in a predictive way, by predicting it only from H the temporally coincident picture in the base layer.

Fig. 3 illustrates a particular type of scalable video encoder architecture 30 for all-INTRA mode, here referred to as the INTRA LCC encoder. This coder is dedicated to the encoding of a spatial or SNR (signal to noise) enhancement layer on top of a standard coded base layer. The base layer is compliant with the HEVC or H.264/AVC video compression standard.

The overall architecture of the INTRA encoder 30 is now described. The input full resolution original picture 31 is down-sampled 30A to the base layer resolution level 32 and is encoded 3DB with HEVC 33. This produces a base layer bit-stream 34.

The picture 31 is now represented by a base layer which is essentially at a lower resolution than the original. Then the base layer picture 33 is reconstructed 30C to produce a decoded base layer image 35 and up-sampled 300 to the top layer resolution in case of spatial scalability to produce an image 36. Thus information from only one (base) layer of the original picture 31 is now available. This constitutes a decrease in image data available and a lower quality image. The difference SOE with the original picture constitutes the spatial residual picture 37. The residual picture 37 is now subjected to the normal encoding process 30F which comprises transformation, quantisation and entropy operations. The processing is performed sequentially on macroblocks using a DCT (Direct Cosine Transform) function, to produce a OCT profile over the global image area. Quantisation is performed by fitting with GOD (Generalised Gaussian Distribution) functions the values taken by OCT coefficient, per DCT channel. Use of such functions allows flexibility in the quantisation step, with a smaller step being available for more central regions of the curve. An optimal centroid position per quantisation step may also be applied to optimise the quantisation process. Entropy coding is then applied (e.g. using arithmetic coding) for the quantised data, The result is the enhancement layer 38 associated in the coding with the original picture 31. The enhancement layer is also converted into a bit-stream 39 with its associated parameters 39 (39 prime).

For down sampling, H.264/SVC down-sampling filters are used and for up sampling, the DCTIF interpolation filters of quarter-pel motion compensation in HEVC are used.

The resulting residual picture is encoded using OCT and quantization, which will be further elucided with reference to Fig. 5. The resulting coded enhancement layer 38 consists of coded residual data as well as some parameters used to model DOT channels of the residual picture.

As can be seen, this global architecture corresponds to classical scalable INTRA coding, where the spatial intra prediction and coding mode decision steps have been removed. The only prediction mode used in this INTRA scalable coder is the known inter-layer intra prediction mode.

Fig. 4 illustrates a scalable video decoder 40 associated with the type of scalable video encoder architecture 30 for all-INTRA mode (as shown in Fig. 3). The inputs to the decoder 40 are equivalent to the base layer bit-stream 34 and the enhancement layer bit-stream 39, with its associated parameters 39' (39 prime). The input bit-stream to that decoder comprises the HEVC-coded base layer 33, enhancement residual coded data 38, and parameters 39' of the DOT channels in the enhancement residual picture. First, the base layer is decoded 40A, which provides a reconstructed base picture 41. The reconstructed base picture 41 is up-sampled 40B to the enhancement layer resolution to produce an image 42. Then, the enhancement layer 38 is decoded 40C as follows. The residual data decoding process 400 is further described in association with Fig. 6. This process is invoked, which provides successive de-quantized DCI blocks 43. These DOT blocks are then inverse transformed and added to their co-located up-sampled block 40D. The so-reconstructed enhancement picture 44 finally undergoes HEVC post-filtering processes 40E, i.e. de-blocking filter, sample adaptive offset (SAO), and Adaptive Loop Filter (ALF). A filter reconstructed image 45 is produced.

Fig. 5 illustrates the coding process 50 associated with the residuals of an enhancement layer, an example of which is picture 37 shown in Fig. 3. The coding process comprises transformation by DCI function, quantisation and entropy coding.

This process, embodying an embodiment of the invention, is also referred to as texture encoding.. Note that this process applies on a complete residual picture, and does not proceed block by block, like in classical H.264/AVC or HEVC intra coding.

The input to the encoder 37 consists of a set of OCT blocks. Several OCT transform sizes are supported in the transform process: 16, 8 and 4. The transform size is flexible and is decided SOA according to the characteristics of the input data. The input residual picture 37 is first divided into 16x16 macroblocks. The transform size is decided for each macroblock as a function of its activity level in the pixel domain. Then the transform is applied SOB, which provides a frame of DCT block 51. The transforms H used are the 4x4, 8x6 and 16x16 DCT, as defined in the HEVC standard.

The next coding step comprises computing, by channel modelling SOC, a statistical model of each OCT channel 52. A OCT channel consists of the set of values taken by samples from all picture blocks at same DCT coefficient position, for a given transform size. OCT coefficients are modelled by a Generalized Gaussian Distribution (GGD). For such a distribution, each OCT channel is assigned a quantifier. This non-uniform scalar quantifier 53 is defined by a set of quantization intervals and associated de-quantized sample values. A pool of such quantifiers 54 is available on both the encoder and on the decoder side. Various quantifiers are pre-computed off-line, through the Chou-Lookabaugh-Gray rate distortion optimization process.

The selection of the rate distortion optimal quantifier for a given DCT channel proceeds as follows. Given input coding parameters, a distortion target 55 is determined for the DCT channel under consideration. To do so, a distortion target allocation among various DCT channels, and among various block sizes, is performed.

The distortion allocation ensures that each OCT channel of each block size should be encoded at level that corresponds to identical rate distortion slope among all coded OCT channels. This rate distortion slope depends on an input quality parameter, given by the user.

Once the distortion target 55 is obtained for each OCT channel, the right quantifier 53 to use is chosen 500. As the rate distortion curve associated to each pre-computed quantifier is known (tabulated), this merely consists in choosing the quantifier that provides minimal bitrate for given distortion target.

Then OCT coefficients are quantized 50E to produce quantised DCT X0 values 56, and entropy coded 50F to produce a set of values H(XQ) 57. The entropy coder used consists of a simple, non-contextual, non-adaptive arithmetic coder. The arithmetic coding employs, for each DCT channel, a set of fixed probabilities, respectively associated to each pre-computed quantization interval. Therefore, these probabilities are entirely calculated off-line, together with the rate distortion optimal quantifiers. Probability values are never updated during the encoding or decoding processes, and are fixed for the whole picture being processed. In particular, this ensures the spatial random access feature, and also makes the decoding process highly parallelizable.

As a result of the proposed INTRA enhancement coding scheme, the enhancement layer bit-stream is made of the following syntax elements.

* Parameters of each coded OCT channel model 39' (39 prime). Two parameters are needed to fully specify a generalized Gaussian distribution. Therefore, two parameters are sent for each encoded DCT channel. These are sent only once for each picture.

* Chosen block sizes 58 are arithmetic encoded 50F. The probabilities H used for their arithmetic coding are computed during the transform sizes selection, are quantized and fixed-length coded into the output bit-stream. These probabilities are fixed for the whole picture.

* Coded residual data 39 results from the entropy coding of quantized OCT coefficients.

Note that the above syntax elements represent the content of coded slice data in the scalable extension of HEVC. The NAL unit container of HEVC can be used to encapsulate a slice that is coded according to the coding scheme of Figure 5.

Fig. 6 depicts the INTRA decoding process 60 which corresponds to the encoding process illustrated in Fig. 5. The input to the decoder consists in the coded residual data 39 and the parametric model of DCT channels 39' (39 prime), for the input picture 37.

First, following a process similar to that effected in the encoder, the decoder determines the distortion target 55 of each DCT channel, given the parametric model of each coded DCT channel 39' (39 prime). Then, the choice of optimal quantizers (or quantifiers) SOD for each DCT channel is performed exactly in the same way as on the encoder side. Given the chosen quantifiers 53, and thus probabilities of all quantized DCT symbols, the arithmetic decoder is able to decode the input coded residual data 39. This provides successive quantized DCT blocks, which are then inverse quantized BOA and inverse transformed BOB. The transform size of each DCT block is obtained from the entropy decoding step 60C.

Fig. 7 and Fig. 8 illustrate the video sequence structure in case of INTER coding, in so-called "low delay" and "random access" configurations, respectively.

These are the two coding structures comprised in the common test conditions in the HEVC standardization process.

Fig. 7 shows the low-delay temporal coding structure 70. In this configuration, an input image frame is predicted from several already coded frames.

Therefore, only forward temporal prediction, as indicated by arrows 71, is allowed, which ensures the low delay property. The low delay property means that on the decoder side, the decoder is able to display a decoded picture straight away once this picture is in a decoded format, as represented by arrow 72. Note: the input video sequence is shown as comprised of a base layer 73 and an enhancement layer 74, which are each further comprised of a first image frame I and subsequent image frames B. In addition to temporal prediction, inter-layer prediction between the base 73 and enhancement layer 74 is also illustrated in Fig. 7 and referenced by arrows, including arrow 75. Indeed, the scalable video coding of the enhancement layer 74 aims to exploit the redundancy that exists between the coded base layer 73 and the enhancement layer 74, in order to provide good coding efficiency in the enhancement layer 74.

As a consequence, several prediction modes can be employed in the coding of enhancement pictures.. This type of standard HEVC coding can be rendered compatible with the texture coding according to an embodiment of the present invention detailed above.

Fig. 8 illustrates the random access temporal coding structure 80 e.g. as defined in the HEVC standard. The input sequence is broken down into groups of pictures, here indicated by arrows GOP. The random access property means that several access points are enabled in the compressed video stream, i.e. the decoder can start decoding the sequence at a frame which is not necessarily the first frame in the sequence. This takes the form of periodic INTRA picture coding in the stream as illustrated by figure 8.

In addition to INTRA pictures, the random access coding structure allows INTER prediction, both forward 81 and backward 82 (in relation to the display order as represented by arrow 83) predictions can be effected. This is achieved by the use of B pictures, as illustrated. The random access configuration also provides temporal scalability feature, which takes the form of the hierarchical B pictures, B to 63 as illustrated, the organization of which is shown in the figure.

As for the low delay coding structure of Fig. 7, additional prediction tools are used in the coding of enhancement pictures: inter-layer prediction tools. H This type of standard HEVC coding can be rendered compatible with the texture coding according to an embodiment of the present invention detailed above.

The goal is to design a temporal and inter-layer prediction scheme that is compliant with the texture codec of Figures 5 and 6, and which is efficient. By efficient, one means the predictor provided by this prediction scheme provides prediction values which are as close to the original picture as possible, in order to favor compression efficiency in the enhancement layer.

To achieve this goal, the prediction process must respect the following property. An embodiment of invention needs to have a full residual picture to be able to perform a DCT transform and DCT channel modeling over the complete picture area, Le. globally. Therefore, the prediction process must provide a full prediction picture of an enhancement picture to encode, before starting to transform, quantize and encode this enhancement picture. In other words, when predicting a (e.g. rectangular) block of the enhancement picture, the prediction must not depend on neighboring pixel values of the block. Indeed, in the opposite case it would be necessary to encode and reconstruct those neighboring blocks before computing the prediction current block, which is not compliant with the decoding process according to an embodiment of the invention.

Fig. 9 illustrates a standard video encoding device, of a generic type, conforming to the HEVC or H.264/AVC video compression system. A block diagram 90 of a standard NEVC or H.2641AVC encoder is shown. The input to this non-scalable encoder consists in the original sequence of frame images 91 to compress. The encoder successively performs the following steps to encode a standard video bit-stream. A first picture or frame to be encoded (compressed) is divided into pixel blocks, called coding unit in the HEVC standard, The first picture is thus split into blocks or macroblocks 92. Each block first undergoes a motion estimation operation 93, which comprises a search, among the reference pictures stored in a dedicated memory buffer 94, for reference blocks that would provide a good prediction of the block. This motion estimation step provides one or more reference picture indexes which contain the found reference blocks, as well as the corresponding motion vectors. A motion compensation step 95 then applies the estimated motion vectors on the found reference blocks and copies the so-obtained blocks into a temporal prediction picture.

Moreover, an Intra prediction step 96 determines the spatial prediction mode that would provide the best performance to predict the current block and encode it in INTRA mode.

Afterwards, a coding mode selection mechanism 97 chooses the coding mode, among the spatial and temporal predictions, which provides the best rate distortion trade-off in the coding of the current block. The difference between the current block 92 (in its original version) and the so-chosen prediction block (not shown) is calculated. This provides the (temporal or spatial) residual to compress. The residual block then undergoes a transform (DCT) and a quantization 96. Entropy coding 99 of the so-quantized coefficients QTC (and associated motion data MD) is performed. The compressed texture data 100 associated to the coded current block 92 is sent for output.

Finally, the current block is reconstructed by scaling and inverse transform 101. This comprises inverse quantization and inverse transform, followed by a sum between the inverse transformed residual and the prediction block of the current block.

Once the current picture is reconstructed and deblocked 102, it is stored in a memory buffer 94 (the DPB, Decoded Picture Buffer) so that it is available for use as a reference picture to predict any subsequent pictures to be encoded.

Finally, a last entropy coding step is given the coding mode and, in case of an inter block, the motion data, as well as the quantized DCT coefficients previously calculated. This entropy coder encodes each of these data into their binary form and encapsulates the so-encoded block into a container called NAL unit (Network Abstract Layer). A NAL unit contains all encoded coding units from a given slice. A coded HEVC bit-stream consists in a series of NAL units.

Fig. 10 illustrates a block diagram of a scalable video encoder, which comprises a straightforward extension of the standard video coder of Fig. 9, towards a scalable video coder. This video encoder may comprise a number of subparts or stages, illustrated here are two subparts or stages AiD and BlO producing data corresponding to a base layer 103 and data corresponding to one enhancement layer 104. Each of the subparts AlO and BiD follows the principles of the standard video encoder 90, with the steps of transformation, quantisation and entropy coding being applied in two separate paths, one corresponding to each layer.

The first stage BID aims at encoding the H.264/AVC or HEVC compliant base layer of the output scalable stream, and hence is identical to the encoder of Fig. 9. Next, the second stage AlO illustrates the coding of an enhancement layer on top of the base layer. This enhancement layer brings a refinement of the spatial resolution to the (down-sampled 107) base layer. As illustrated in Fig. 10, the coding scheme of this enhancement layer is similar to that of the base layer, except that for each coding unit of a current picture 91 being compressed or coded, an additional prediction mode can be chosen by the coding mode selection module 105. This new coding mode corresponds to the inter-layer prediction 106. Inter-layer prediction 106 consists in re-using the data coded in a layer lower than current refinement or enhancement layer, as prediction data of the current coding unit. The lower layer used is called the reference layer for the inter-layer prediction of the current enhancement layer. In case the reference layer contains a picture that temporally coincides with the current picture, then it is called the base picture of the current picture. The co-located block (at same spatial position) of the current coding unit that has been coded in the reference layer can be used as a reference to predict the current coding unit. More precisely, the prediction data that can be used in the co-located block corresponds to the coding mode, the block partition, the motion data (if present) and the texture data (temporal residual or reconstructed block). In case of a spatial enhancement layer, some up-sampling 108 operations of the texture and prediction data are performed.

In conclusion, standard video coding approaches as detailed in association with Figs. 9 and 10, proceed on a block basis. In particular, all treatments related to the encoding of one coding unit (or block) are fully done before starting to process and to code a subsequent block in the picture or frame. This means the spatial prediction, temporal prediction, coding mode decision and actual coding are fully processed before considering a next block in the picture. This is mainly due to the dependency that exists between neighbouring blocks in the pictures to be coded. Indeed, in standard video coders like HEVC or H.264/AVC, some causal dependencies exist between spatial neighbouring macroblocks of a coded picture. These dependencies mainly arise from: -Prediction information, i.e. spatial intra prediction parameters (direction used in the spatial prediction) and temporal prediction information (e.g. motion vectors) are encoded in a predictive way from block to block. This creates a dependency between successive coded blocks. (Such predictive coding of prediction information is acceptable and can be maintained in the design of the scalable video codec addressed by an embodiment of this invention).

-Spatial prediction in the pixel domain creates a dependency between neighbouring blocks on the texture level. More precisely, a given block needs to be fully available in its decoded (or reconstructed) version in the pixel domain, before the standard coder starts to process the next block in the picture. This is necessary so that texture spatial prediction from block to block is done in a perfectly synchronized way on the encoder and on the decoder side. Such spatial dependency between neighbouring blocks in the pixel domain is not compliant with the use of the texture coding process of Fig. 5 to encode INTER pictures. One of the problems addressed by an embodiment of this invention is how to design a scalable video coding framework, so that the texture coding process of Fig. 5 can be used to encoder INTER pictures of a scalable refinement layer.

Fig. 11 provides a block diagram of a standard HEVO or H.264/AVC decoding system 1100. This decoding process of a H,264 bit-stream 1110 starts by the entropy decoding 1120 of each block (array of pixels) of each coded picture in the bit-stream. This entropy decoding provides the coding mode, the motion data (reference pictures indexes, motion vectors of INTER coded macroblocks) and residual data. This residual data consists in quantized and transformed DCT coefficients. Next, these quantized DCT coefficients undergo inverse quantization (scaling) and inverse transform operations 1130.

The decoded residual is then added to the temporal 1140 or INTRA 1150 prediction macroblock of current macroblock, to provide the reconstructed macroblock.

The choice 1125 between INTRA or INTER prediction depends on the prediction mode information which is provided by the entropy decoding step.

The reconstructed macroblock finally undergoes one or more in-loop post-filtering processes, e.g. deblocking 1160, which aim at reducing the blocking artefact inherent to any block-based video codec, and improve the quality of the decoded picture.

The full post-filtered picture is then stored in the Decoded Picture Buffer (DPB), represented by the frame memory 1170, which stores pictures that will serve as references to predict future pictures to decode. The decoded pictures 1160 are also ready to be displayed on screen.

Fig. 12 presents a block diagram of a scalable decoder 1200 which would H apply on a scalable bit-stream made of two scalability layers, e.g. comprising a base H layer and an enhancement layer. This decoding process is thus the reciprocal processing of the scalable coding process of Fig. 10. The scalable stream being decoded 1210, as shown in Fig. 12, is made of one base layer and one spatial enhancement layer on top of the base layer, which are demultiplexed 1220 into their respective layers.

The first stage of Fig. 12 concerns the base layer decoding process 312.

As previously explained for the non-scalable case, this decoding process starts by entropy decoding 1120 each coding unit or block of each coded picture in the base layer. This entropy decoding 1120 provides the coding mode, the motion data (reference pictures indexes, motion vectors of INTER coded macroblocks) and residual data. This residual data consists of quantized and transformed DCT coefficients. Next, these quantized DCT coefficients undergo inverse quantization and inverse transform operations 1130. Motion compensation 1140 or lntra prediction 1150 data can be added 12G.

Deblocking 1160 is effected. The so-reconstructed residual data is then stored in the frame buffer 1170.

Next, the decoded motion and temporal residual for INTER blocks, and the reconstructed blocks are stored into a frame buffer in the first of the scalable decoder of Fig. 12. Such frames contain the data that can be used as reference data to predict an upper scalability layer.

Next, the second stage of Fig. 12 performs the decoding of a spatial enhancement layer Al 2 on top of the base layer decoded by the first stage. This spatial enhancement layer decoding involves the entropy decoding of the second layer 1210, which provides the coding modes, motion information as well as the transformed and quantized residual information of blocks of the second layer.

Next step consists in predicting blocks in the enhancement picture. The choice 1215 between different types of block prediction (INTRA, INTER or inter-layer) depends on the prediction mode obtained from the entropy decoding step 1210.

Concerning INTRA blocks, their treatment depends on the type of INTRA coding unit.

-In case of inter-layer predicted INTRA block (Intra-BL coding mode), the result of the entropy decoding 1210 undergoes inverse quantization and inverse

H

transform 1211, and then is added 12D to the co-located block of current block in base picture, in its decoded, post-filtered and up-sampled (in case of spatial scalability) version.

-In case of a non-lntra-BL INTRA block, such a block is fully reconstructed, through inverse quantization, inverse transform to obtain the residual data in the spatial domain, and then INTRA prediction 1230 to obtain the fully reconstructed block 1250.

Concerning INTER blocks, their reconstruction involves their motion compensated 1240 temporal prediction, the residual data decoding and then the addition of their decoded residual information to their temporal predictor. In this INTER block decoding process, inter-layer prediction can be used in two ways. First, the motion vectors associated to the considered block can be decoded in a predictive way, as a refinement of the motion vector of the co-located block in the base picture.

Second, the temporal residual can also be inter-layer predicted form the temporal residual of the co-sited block in the base layer.

Note that in a particular scalable coding mode of the block all the prediction information of the block (e.g. coding mode, motion vector) may be fully inferred from the co-located block in the base picture. Such block coding mode is known as so-called "base mode" in the state of the art.

Fig. 13 illustrates an encoder architecture 1300 according to an embodiment of the current invention. The goal of this scalable codec design is to exploit inter-layer redundancy in an efficient way through inter-layer prediction, while enabling the use of the low-complexity texture encoder of Fig. 5.

The diagram of Fig. 13 illustrates the base layer coding, and the enhancement layer coding process for a given picture of a scalable video, as proposed by an embodiment of the invention.

The first stage of the process corresponds to the processing of the base layer, and is illustrated on the bottom part of the figure 1 300A.

First, the input picture to code 1310 is down-sampled 13A to the spatial resolution of the base layer, a raw base layer 1320. Then it is encoded 13B in an HEVC compliant way, which leads to the encoded base layer" 1330 and associated base layer bit-stream 1340.

In the next step, some information is extracted from the coded base layer that will be useful afterwards in the inter-layer prediction of the enhancement picture.

The extracted information comprises at least.

-The reconstructed (decoded) base picture 1350 which is later used for inter-layer texture prediction.

-The prediction information 1370 of the base picture which is used in several inter-layer prediction tools in the enhancement picture. It comprises, among others, coding unit information, prediction unit partitioning information, prediction modes, motion vectors, reference picture indices, etc. -Temporal residual data 1360, used for temporal prediction in the base layer, is also extracted from the base layer, and is used next in the prediction of the enhancement picture.

Once all this information has been extracted from the coded base picture, it undergoes an up-sampling process, which aims at adapting this information to the spatial resolution of the enhancement layer. The up-sampling of the extracted base information is effected as described below, for the three types of data listed above.

-With respect to the reconstructed base picture 1350, it is being up-sampled to the spatial resolution of the enhancement layer 1380A. In the same way as for the INTRA LCC coder of figure 3, an interpolation filter corresponding to the DCTIF 8-tap filter used for motion compensation in HEVC is employed.

-The base prediction information 1370 is being transformed, so as to obtain a coding unit representation that is adapted to the spatial resolution of the enhancement layer 1 380C. The prediction information up-sampling mechanism is introduced below.

-The temporal residual information 1360 associated to INTER predicted blocks in the base layer is collected into a picture buffer, and is up-sampled to 13803 by means of a 2-tap bi-linear interpolation filter. This bi-linear interpolation of residual data is identical to that used in the former H.264/SVC scalable video coding standard.

Once all the information extracted from the base layer is available in its up-sampled form, then the encoder is ready to predict 13C the enhancement picture. The prediction process used in the enhancement layer is at the core of an embodiment of the invention and is executed in a strictly identical way on the encoder side and on the decoder side.

The prediction process consists in selecting the enhancement picture organization in a rate distortion optimal way in terms of coding unit (CU) representation, prediction unit (PU) partitioning and prediction mode selection. (These concepts are further defined later in connection with Fig. 14, and form part of the HEVC standard).

Fig. 14 depicts the coding units and prediction unit concepts specified in the HEVC standard. An HEVC coded picture is made of a series of coding units. A coding unit of an HEVC picture corresponds to a square block of that picture, and can have a size in a pixel range from 8x8 to 64x64. A coding unit which has the highest size authorized for the considered picture is also called a Largest Coding Unit (LCU) 1410.

For each coding unit of the enhancement picture, the encoder decides how to partition it into one or several prediction units (PU) 1420. Each prediction unit can have a square or rectangular shape and is given a prediction mode (INTRA or INTER) and some prediction information. With respect to INTRA prediction, the associated prediction parameters consist in the angular direction used in the spatial prediction of the considered prediction unit, associated with corresponding spatial residual data. In case of INTER prediction, the prediction information comprises the reference picture indices and the motion vector(s) used to predict the considered prediction unit, and the associated temporal residual texture data. Illustrations 14A to 14H show some of the possible arrangements of partitioning which are available.

Referring again to Fig. 13, the prediction process 13C attempts to construct a whole prediction picture 1391 of current enhancement picture to code. To do so, it determines the best rate distortion trade-off between the quality of that prediction picture and the rate cost of the prediction information to encode.

The outputs of this prediction process are the following ones: -A set of coding units with associated size, which covers the whole prediction picture.

-For each coding unit, a partitioning of this coding unit into one or several prediction units. Each prediction unit is selected among all the prediction unit shapes allowed by the HEVC standard, which are illustrated on the bottom of figure 14.

-For each prediction unit, a prediction mode decided for that prediction unit, together with the prediction parameters associated with that prediction unit.

Therefore, for each candidate coding unit in the enhancement picture, the prediction process of Fig. 13 determines the best prediction unit partitioning and prediction unit parameters in that candidate CU.

In particular, for a given prediction unit partitioning of the CU, the prediction process searches the best prediction type for that prediction unit. In HEVC, each prediction unit is given the INTRA or INTER prediction mode. For each mode1 prediction parameters are determined. INTER prediction mode consists in the motion compensated temporal prediction of the prediction unit. This uses two lists of past and future reference pictures depending on the temporal coding structure used (see Fig. 7 and Fig 8). This temporal prediction process as specified by HEVC is re-used here.

This corresponds to the prediction mode called "HEVC temporal predictors 1390 on Fig. 13. Note that in the temporal predictor search, the prediction process searches the best one or two (respectively for uni-and bi-directional prediction) reference blocks to predict a current prediction unit of current picture.

INTRA prediction in HEVC consists in predicting a prediction unit with the help of neighboring Pus of current prediction unit that are already coded and reconstructed. Such spatial prediction process cannot be used in the proposed system, because it is not compliant with the use of the texture coding process of Fig. 5.

As a consequence, the spatial prediction process of HEVC has been replaced in the coder of Fig.13 by two prediction types, called "Intra BlJ' and "Base Mode". The Intra BL prediction type consists of predicting a prediction unit of the enhancement picture with the spatially corresponding block in the up-sampled decoded base picture. The "Base Mode" prediction mode consists in predicting an enhancement prediction unit from the spatially corresponding block in a so-called "Base Mode prediction picture". This Base Mode prediction picture is constructed with the help of inter-layer prediction tools. The construction of this base mode prediction picture is explained in detail below, with reference to Fig. 18. Briefly, it is constructed by predicting current enhancement picture by means of the up-sampled prediction information and temporal residual data that has previously been extracted from the base layer and re-sampled to the enhancement spatial resolution.

Note that the Intra BL" and "Base Mode" prediction modes try to exploit the redundancy that exists between the underlying base picture and current enhancement picture. They correspond to so-called inter-layer prediction tools that we have introduced into the HEVC coding system.

The "rate distortion optimal mode decision" of Figure 13 results in the following elements.

-A set of coding unit representations with associated prediction information for current picture. This is called prediction information 1392 on Fig. 13. All this information then undergoes a prediction information coding step, which constitutes a part of the coded video bit-stream. Note that in this prediction information coding, the two inter-layer prediction modes, Le. Intra BL and Base Mode, are signaled as particular INTRA prediction modes. As a result, in terms of prediction information coding, the spatial prediction modes of HEVC are all removed and two INTRA prediction modes are used instead.

Note that according to another embodiment, the "Intra BL" and "Base Mode" prediction pictures of Fig. 13 can be inserted into the list of reference pictures used in the temporal prediction of current enhancement picture.

-A picture 1391, which represents the final prediction picture of current enhancement picture to code. This picture is then used to encode the texture data part of current enhancement picture.

The next encoding step illustrated in Fig. 13 consists of computing the difference 1393 between the original picture and the obtained prediction picture. This difference comprises the residual data of current enhancement picture 1394, which is then processed by the texture coding process 130, as described above. The process provides encoded DCT X values 1395 which comprise enhancement coded texture for output and decoder information such as parameters of the channel model 1397 for output. A further available output is the enhancement coded prediction information 1398 derived from the prediction information 1392.

Fig. 15 summarizes all the prediction modes that can be used in the proposed scalable codec architecture, according to an embodiment of the invention, used to predict a current enhancement picture. Schematic 1510 corresponds to the current enhancement to predict. The base picture 1520 corresponds to the base layer H decoded picture that temporally coincides with current enhancement picture.

Schematic 1530 corresponds to an example reference picture in the enhancement layer used for the temporal prediction of current picture 1510. Finally, schematic 1540 corresponds to the Base Mode prediction picture introduced above in association with H Fig. 13.

As illustrated by Fig. 15, and as explained above, the prediction of current enhancement picture 1510 consists in determining, for each block 1550 in current enhancement picture 1510, the best available prediction mode for the that block 1550, H considering temporal prediction, Intra BL prediction and Base Mode prediction.

Fig. 15 also illustrates the fact that the prediction information contained in the base layer is extracted, and then is used in two different ways.

First, the prediction information of the base layer is used to construct 1560 the "Base Mode" prediction picture 1540. This construction is discussed below with reference to Fig. 16.

Second, the base layer prediction information is used in the predictive coding 1570 of motion vectors in the enhancement layer. Therefore, the INTER prediction mode illustrated on Fig. 15 makes use of the prediction information contained in the base picture 1520. This allows inter-layer prediction of the motion vectors of the enhancement layer, hence increases the coding efficiency of the scalable video coding system.

Fig. 16 depicts an architecture of the scalable video decoder 1600 according to an embodiment of the invention. This decoder architecture performs the reciprocal process of the encoding process of Fig. 13.

The inputs to the decoder illustrated in Fig. 16 are: -The coded base layer bit-stream 1601 -The coded enhancement layer bit-stream 1602.

The first stage of the decoding process 1 6A corresponds to the base layer, starting with the decoding 16A' (prime) of the base layer encoded base picture 1610.

This decoding is then followed by the preparation of all data useful for the inter-layer prediction of the enhancement layer. The data extracted from the base layer decoding step is of three types: -The decoded base picture 1611 undergoes a spatial up-sampling step 16C, in order to form the "Intra BL" prediction picture 1612. The up-sampling process 16C used here is identical to that of the encoder (Fig. 13).

-The prediction information contained in the base layer (base motion information 1613) is extracted and re-sampled 16D towards the spatial resolution of the enhancement layer. The prediction info up-sampling process is the same as that used on the encoder side.

-The temporal residual texture data contained in the base layer (base residual 1615) is extracted and up-sampled 16E, in the same way as on the encoder side, to give up-sampled residual information 1614.

Once all the base layer texture and prediction information has been up-sampled, then it is used to construct the Base Mode" prediction picture 1616, exactly in the same way as on the encoder side.

Next, the processing of the enhancement layer 16B is effected as illustrated in the upper part of Fig. 16. This begins with the entropy decoding 16F of the prediction information contained in the enhancement layer bit-stream to provide decode prediction information 1630. This, in particular, provides the coding unit organization of the enhancement picture, as well as their partitioning into prediction units, and the prediction mode (coding mode 1631) associated to each prediction unit.

Once the prediction mode of each prediction unit of the enhancement picture is obtained, the decoder 1600 is able to construct the final complete prediction picture 1650 that was used in the encoding of current enhancement picture.

The next decoder steps then consist of decoding 16G the texture data (encoded DCT X 1632) associated to current enhancement picture. This LCC texture decoding process follows the same process as explained above with reference to Fig. 6 and produces decoded residual data XdeQ 1633. The channel model parameters 1634 are also entropy decoded and are use as part of the texture coding 16G.

Once the entire residual picture 1633 is obtained from the texture decoding process, it is added 161-I to the prediction picture 1650 previously constructed. This leads to the decoded current enhancement picture 1635 which, optionally, undergoes some in-loop post-filtering process 161. Such processing may comprise the HEVC deblocking filter1 Sample Adaptive Offset (specified by HEVC) and Adaptive Loop Filtering (also specified by the HEVC standard).

The decoded picture 1660 is ready for display and the individual frames can each be stored as a decoded reference picture 1661, which may be useful for motion compensation 16J in association with the HEVC temporal predictor 1670, as applied for subsequent frames.

Fig. 17 depicts the prediction information up-sampling process, executed both by the encoder and the decoder in order to construct the Base Mode" prediction picture e.g. 1540. The prediction information up-sampling step is a useful mean to perform inter-layer prediction.

The left side of Fig. 17 1710 illustrates a part of the base layer picture. In particular, the Coding Unit representation that has been used to encode the base picture is illustrated, for the two first LCU (Largest Coding Unit) of the picture 1711 and 1712. The LCUs have a height and width, represented by arrows 1713 and 1714, respectively, and an identification number 1715, here shown running from zero to two.

The Coding Unit quad-tree representation of the second LCU 1712 is illustrated, as well as prediction unit (PU) partitions e.g. partition 1716. Moreover, the motion vector associated to each prediction unit, e.g. vector 1717 associated with prediction unit 1716, is showed.

On the right side of Fig. 17 is shown the enhancement layer sizing 1750 of the base layer 1710, the result of the prediction information up-sampling process can be seen. On this figure, the LCU size (height and width indicated by arrows 1751 and 1752, respectively) is the same in the enhancement picture and in the base picture, i.e. the base picture LCU has been magnified. As can be seen, the up-sampled version of base LCU 1712 results in the enhancement LCUs 2, 3, 6 and 7 (references 1753, 1754, 1755 and 1756, respectively). The individual prediction units exist in a scaling relationship known as a quad-tree. Note that the coding unit quad-tree structure of coding unit 1712 has been re-sampled in 1750 as a function of the scaling ratio that exists between the enhancement picture and the base picture. The prediction unit partitioning is of the same type (i.e. the corresponding prediction units have the same shape) in the enhancement layer and in the base layer. Finally, motion vector coordinates e.g. 1757 have been re-scaled as a function of the spatial ratio between the two layers.

In other words, three main steps are involved in the prediction information up-sampling process: -The coding unit quad-tree representation is first up-sampled. To do so, a depth parameter of the base coding unit is decreased by one in the enhancement layer.

-The coding unit partitioning mode is kept the same in the enhancement layer, compared to the base layer. This leads to prediction units with an up-scaled size in the enhancement layer, which have the same shape as their corresponding prediction unit in the base layer.

-The motion vector is re-sampled to the enhancement layer resolution, simply by multiplying associated x and y coordinates by the appropriate scaling ratio.

As a result of the prediction information up-sampling process, some prediction information is available on the encoder and on the decoder side, and can be used in various inter-layer prediction mechanisms in the enhancement layer.

In the current scalable encoder and decoder architectures, these up-scaled prediction information are used in two ways.

-They are used in the construction of the Base Mode" prediction picture of current enhancement picture, as already discussed with reference to Fig. 13 and Fig. 16.

-The up-sampled prediction information is also used for the inter-layer prediction of motion vectors in the coding of the enhancement picture. Therefore one additional predictor is used compared to HEVC, in the predictive coding of motion vectors.

Fig. 18 illustrates the construction of a Base Mode prediction picture 1800 according to an embodiment of the invention. This picture is referred to as a Base Mode picture, because it is predicted by means of the prediction information issued from the base layer 1801. The figure also indicates the magnification 1802 of the base layer 1801 to the dimensions of an associated enhancement layer. The inputs to this process are the following: -The lists of reference pictures e.g. 1 SOS useful in the temporal prediction of current enhancement picture i.e. the Base Mode prediction picture 1800.

-The prediction information e.g. temporal prediction 18A extracted from the base layer 1801 and re-sampled e.g. temporal prediction 1SB to the enhancement layer 1802 resolution. This corresponds to the prediction information resulting from the process described in association with Fig. 17.

-The temporal residual data issued from the base layer decoding, and re-sampled to the enhancement layer resolution e.g. inter-layer temporal residual prediction 18C.

-The base layer reconstructed picture 1804.

The Base Mode picture construction process consists in predicting each coding unit e.g. largest coding unit 1805 of the enhancement picture, conforming to the prediction modes and parameters inherited from the base layer.

It proceeds as follows.

-For each largest coding unit in current enhancement picture 1605 Obtain the up-sampled Coding Unit representation issued from the base layer (algorithm of figure 17) For each CU contained in current LCU * For each PU in current CU * Predict current PU with its prediction information inherited from the base layer The prediction unit prediction step proceeds as follows. In case the corresponding base prediction unit was Intra-coded e.g. base layer intra coded block 1806, then current prediction unit is predicted by the reconstructed base prediction unit, re-sampled to the enhancement layer resolution 1807. This prediction is associated with an inter-layer spatial prediction 1811. In case of an INTER coded base prediction unit 1808, then the corresponding prediction unit in the enhancement layer 1809 is also temporally predicted, by using the motion information 1SB inherited from the base layer iSA, This means the reference picture(s) in the enhancement layer that correspond to the same temporal position of the reference picture(s) of the base prediction unit are used. A motion compensation step 1SB is applied by applying the motion vector inherited 1810 from the base onto these reference pictures. Finally, the up-sampled temporal residual data 18C of the co-located base prediction unit is applied onto the motion compensated enhancement prediction unit, which provides the predicted prediction unit in its final state.

Once this process has been applied on each prediction unit in the enhancement picture, a full "Base Mode" prediction picture is available.

Fig. 19 illustrates an algorithm according to an embodiment of the invention used to encode an INTER picture. The input to the algorithm comprises the original picture to encode, respectively re-sampled to the spatial resolution of each scalability layer to encode.

The overall algorithm consists of a loop over each scalability layer to encode. The current INTER picture is being encoded with each scalability layer being successively or sequentiauy processed through the algorithm. The layers are indexed 1902. For each scalability layer in succession, the algorithm tests 1903 if current layer corresponds to the base layer, the base layer being indexed as layer 0 (zero). If so, then a standard picture encoding process is applied on the current picture. For the case illustrated in Fig. 19, the base picture is HEVC-encoded 1904.

When current layer is not the base layer (e.g. is a first enhancement layer), the algorithm switches to preparing all the prediction data useful to predict current enhancement picture to code, according to an embodiment of the proposed invention.

This data consists in the three main parts: -The decoded base picture of current picture is obtained 1905 and up-sampled 1906 in the pixel domain towards the spatial resolution of current enhancement layer. This provides one prediction picture, called the "lntra DL" prediction picture.

-All the prediction information contained in the coded base layer is extracted from the base picture 1907, and then is up-sampled 1908 towards current enhancement layer, as previously explained with reference to figure 17. Next, this up-sampled prediction info is used in the construction of the "Base Mode" prediction picture 1909 of current enhancement picture, as previously explained with reference to Fig. 18.

-The temporal residual data contained in the base picture is extracted from the base layer 1910, and then is up-sampled 1911 towards the spatial resolution of current enhancement layer.

Next, the up-sampled prediction info, together with this up-sampled temporal residual data, are used in the construction of the "Base Mode" prediction picture of current enhancement picture, as previously explained with reference to Fig. 18.

The next step of the algorithm consists in searching the best way to predict current enhancement picture, given the available set of prediction data previously prepared. The algorithm performs best prediction search 1912 based on the obtained three sets of prediction pictures: temporal reference(s), Intra BL, Base Mode. This prediction search step computes the following data.

-For each Largest Coding Unit (LCU) in current pictures, the search step decides how to divide the LCU into smaller Coding Units (GUs).

-For each Coding Unit, the search step decides how to partition the coding unit into one or more prediction unit(s), and how to predict each prediction unit.

-The prediction parameters decided for each prediction unit include the prediction mode (INTRA or INTER) together with the prediction parameters associated to this prediction mode. With respect to INTER prediction, the same temporal prediction system as in HEVC is employed. Therefore, the prediction parameters include the indexes of the reference picture(s) used to predict current prediction unit, as well as the associated motion vector(s). Concerning INTRA prediction, two types of INTRA prediction are allowed in an embodiment of the proposed invention: Intra BL prediction and Base Mode prediction. The best INTRA prediction between these two modes is determined.

The best prediction for current prediction unit, among the best INTER prediction for that prediction unit and the best INTRA prediction for that prediction unit, is determined.

Next, for a candidate coding unit, the best candidate prediction unit for that coding unit, is selected. Finally the best coding unit splitting configuration (see Fig. 14) for the considered LCU is selected.

Note that the prediction modes that are evaluated in this prediction search step are such that no texture prediction from a block to another block in the same picture is involved. Therefore a whole prediction picture can be computed before that texture coding process starts processing current picture.

Once the prediction search for current picture is done, then a set of prediction information is available for current picture. This prediction information is able to fully describe how current enhancement picture is predicted. Therefore, this prediction information is encoded 1913 and written to the output enhancement bit-stream1 in order to indicate the decoder how to predict current picture.

In addition to the prediction information, the prediction step also provides a full prediction picture for current picture. In the next step of the algorithm of Fig. 19, the so-obtained prediction picture is then subtracted from the original picture to code in current enhancement layer i.e. the residual picture for the current picture is obtained 1914.

The next step then consists in applying the texture coding of Fig. 5 on the residual picture 1915 issued from previous step. Then texture coding process is peformed as described previously with reference to figure 5.

Once current picture is encoded at current scalability level, then the algorithm checks whether current layer is the last scalability layer to encode 1916. II yes, then the algorithm ends 1917. If no, the algorithm moves to process the next scalability layer, i.e. it increments current layer index 1918, and returns to the testing step 1903 described above.

Fig. 20 depicts the overall algorithm used to decode an INTER picture, according to an embodiment of the proposed invention, The input to this algorithm consists in the compressed representations of the input picture, comprising a plurality of scalability layers to be decoded, indexed as 2002.

Similar to the coding algorithm of Fig. 19, this decoding algorithm comprises a main loop on the scalability layers that constitutes the scalable input bit-stream to process.

Each layer is considered sequentially, the following is applied. The algorithm tests 2003 if a current layer corresponds to the lowest layer of the stream, the base layer normally being assigned a value 0 (zero). If so, then a standard, e.g. HEVC, decoding process is applied 2004 on current picture.

If not, then the algorithm prepares all the prediction data useful to construct the prediction picture of current enhancement picture. Thus the same base layer data extraction and processing as on the encoder side is performed (1905 to 1911). This leads to restoration of the set of three prediction data schemes used to construct the prediction picture of current enhancement picture. This is facilitated by computation of the same Intra BL and Base Mode prediction pictures.

The next step of the algorithm comprises decoding the prediction information for the current picture from the input bit-stream 2005. This provides information on how to construct the current prediction picture 2006, given the Intra BL, Base Mode and temporal reference pictures available.

The decoded prediction data thus indicates how each largest coding unit (LCU) is decomposed into coding units (CU) and prediction units (PU), and how each prediction unit is predicted. The decoder is then able to construct the full prediction picture of current enhancement picture being decoded. At this stage of the decoder, exactly the same prediction picture as on the encoder side is available.

The next step comprises the texture decoding of the input coded texture data on the current residual picture 2007, for the entire enhancement picture. The same decoding algorithm is applied as described with reference to Fig. 6.

Once the decoded residual picture is available, the obtained residual picture is added to the prediction picture previously computed 2008, which provides the reconstructed version of current enhancement picture.

Additionally it is possible to follow this with post-processing of current picture (not shown), i.e. a deblocking filter, sample adaptive offset and adaptive loop filtering.

Finally, the algorithm tests if current scalability layer is the last layer to decode 2009. If so, the algorithm of Fig. 20 ends 2010. If not, the algorithm increments the layer 2011 and returns to the testing step 2003, which checks it the current layer is the base layer.

Claims

CLAIMS1. A method for encoding at least one image of scalable video data comprising pixels, the method comprising the steps of: -subtracting a prediction image from an original image to obtain a residual image, said prediction image comprising a set of prediction units, each prediction unit having a prediction mode selected, according to a given criterion, from a plurality of prediction modes including a motion estimation prediction mode and at least one prediction mode using information from a corresponding base layer image; -transforming pixels values of the residual image to obtain transformed coefficients; -quantizing at least one of the transformed coefficients to obtain quantized symbols; -encoding the quantized symbols into encoded data.
2. A method as claimed in claim 1 wherein the obtained transformed coefficients each have a coefficient type.
3. A method as claimed in claims I or 2 wherein the step of transforming further comprises: -determining transform units corresponding to transforms to be applied to pixel values of the residual image to obtain the transformed coefficients;
4. A method as claimed in claim 3, wherein the determination of transform units takes into account the pixel values of the residual image.
5. A method as claimed in claims 3 014, wherein the transform units are embedded in the pixel area corresponding to the prediction units.
6. A method as claimed in any of the preceding claims wherein the prediction mode using information from a corresponding base layer image comprises intra base layer prediction and/or base mode prediction. H
7. A method as claimed in any one of the preceding claims further comprising the step of: -signaling the prediction mode used.
8. A method as claimed in claim 6 or 7 wherein intra base layer prediction mode comprises predicting a prediction unit of an enhancement image with a spatially corresponding unit in an up-sampled corresponding base layer image.
9. A method as claimed in in claim 6 or 7 wherein base mode prediction comprises predicting an enhancement prediction unit from a spatially corresponding unit of a base mode prediction image.
10. A method as claimed in claims 6 or 7 wherein the intra base layer prediction and/or base mode prediction are substituted for special prediction modes comprised in the HEVC standard.
11. A method as claimed in claims 6 to 10 comprising the further steps of: -inserting the prediction image resulting from intra layer prediction and/or base mode prediction into a list of reference pictures; -using the prediction image resulting from intra layer prediction and/or base mode prediction in the temporal prediction of an enhancement picture.
12. A method as claimed in claim 11 comprising the further step of: -signaling the use of intra layer prediction and/or base mode by means of a dedicated reference picture index.
13. A method as claimed in any of the preceding claims wherein the step of quantizing further comprises the step of: -quantizing at least one of the transformed coefficients based on a quantizer that depends on a given probabilistic model.
14. A method as claimed in any of the preceding claims wherein: -the quantizing step is performed by a quantization method comprising determining, for at least one coefficient type, a probabilistic model based on statistical information computed on the residual image for that coefficient type; determining, for the at least one coefficient type, an average distortion target for the encoding of coefficients with the at least one coefficient type; selecting, for the at least one coefficient type, a quantizer from a pre-defined set of quantizers.
15. A method as claimed in any of the preceding claims 1 to 13 wherein: -the quantizing step is performed by a quantization method comprising determining, for at least one coefficient type, a probabilistic model based on statistical information computed on the residual image for that coefficient type; determining, for the at least one coefficient type, an average distortion target for the encoding of coefficients with the at least one coefficient type; selecting, for the at least one coefficient type, a quantizer from a pre-defined set of quantizers according to the determined probabilistic model and the determined distortion target.
16. A method as claimed in claim 14 or 15, wherein symbols issued from the quantizing step undergo an entropy coding step, which takes into account pre-computed probability values associated to each interval of a selected rate distortion optimal quantizer.
17. A method of decoding at least one image comprising pixels, the method comprising the steps of: -adding a prediction image to a decoded enhancement layer residual image to obtain an image, said prediction image comprising a set of prediction units, each prediction unit having a prediction mode selected, according to a given criterion, from a plurality of prediction modes including a motion estimation prediction mode and at least one prediction mode using information from a corresponding base layer image (Intra BL, Base Mode); -said enhancement layer residual image comprising quantized symbols, the decoding of the image comprising inverse quantizing these quantized symbols to obtain transformed coefficients.
18. A method as claimed in claim 17 wherein the prediction modes using information from a corresponding base layer encoded picture comprise intra base layer prediction and/or base mode prediction.
19. A method as claimed in claims 17 or 18 wherein the transformed coefficients are derived from transform units corresponding to transforms applied to pixel values of a residual image.
20. A method as claimed in any of claims 17 to 18, wherein the determination of transform units takes into account the pixel values of the residual image.
21. A method as claimed in any of the above claims 17 to 20, wherein the transform units are embedded in the pixel area corresponding to the prediction units.
22. A method as claimed in any of the above claims 17 to 21, wherein -Each transformed coefficient comprises a coefficient type and; -The inverse quantization step comprises determinating of a set of quantizers, comprising selection of an optimal quantizer for at least one coefficient type, by use of a probabilistic model parameters representative of said probabilistic model for at least one coefficient type being obtained before applying the decoding process.
23. A method as claimed in claim 17 to 22, wherein the quantized symbols have undergone an entropy decoding step, which takes into account pre-computed probability values associated to each interval of a selected rate distortion optimal quantizer.
24. A computer program comprising instructions for carrying out each step of the method according to any one of the preceding claims when the program is loaded and executed by a programmable means.
25. An information storage means readable by a computer or a microprocessor storing instructions of a computer program, wherein it makes it possible to implement a method for optimizing access to a wireless medium according to any of the claims ito 23.
26. A device for encoding at least one image comprising pixels, the device comprising: -means for subtracting a prediction image from an original image to obtain a residual image, said prediction image comprising a set of prediction units, each prediction unit having a prediction mode selected, according to a given criterion, from a plurality of prediction modes including a motion estimation prediction mode and at least one prediction mode using information from a corresponding base layer image; -means for transforming pixels values of the residual image to obtain transformed coefficients; -means for quantizing at least one of the transformed coefficients to obtain quantized symbols; -means for encoding the quantized symbols into encoded data.
27. A device for decoding at least one image comprising pixels, the device comprising: -means for adding a prediction image to a decoded enhancement layer residual image to obtain an image, said prediction image comprising a set of prediction units, each prediction unit having a prediction mode selected, according to a given criterion, from a plurality of prediction modes including a motion estimation prediction mode and at least one prediction mode using information from a corresponding base layer image; -said enhancement layer residual image comprising quantized symbols, the decoding of this image comprising inverse quantizing these quantized symbols.
28. A method for encoding at least one image comprising pixels, substantially as H hereinbefore described with reference to, and shown in Figs. 3, 5, 9, 13, 15, 18 or 19 of the accompanying drawings
29. A method for decoding at least one image comprising pixels, substantially as hereinbefore described with reference to, and shown in, Figs. 4, 6, 16 or 20 of the accompanying drawings
30. A device for encoding at least one image comprising pixels, substantially as hereinbefore described with reference to, and shown in, Figs. 3, 5, 9, 13, 15, 18 or 19 of the accompanying drawings
31. A device for decoding at least one image comprising pixels, substantially as hereinbefore described with reference to, and shown in, Figs. 4, 6, 16 or 20 of the accompanying drawings