GB2509702A

GB2509702A - Scalable Image Encoding Including Inter-Layer Prediction

Info

Publication number: GB2509702A
Application number: GB201300144A
Authority: GB
Inventors: Christophe Gisquet; Patrice Onno; Guillaume Laroche; Edouard Francois
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-01-04
Filing date: 2013-01-04
Publication date: 2014-07-16
Anticipated expiration: 2033-01-04
Also published as: GB201300144D0; GB2509702B

Abstract

A scalable encoding scheme has at least one enhancement layer and at least one reference layer. For a unit of data of an enhancement layer to be encoded, a plurality of inter coding modes in which a residual is obtained from a temporal predictor are evaluated. In a first of these modes, the residual is predicted using information from a reference (base) layer, constituting an Inter-Layer mode. Fewer candidate predictor evaluations are performed for the first (inter layer) mode than for the other inter coding modes. Alternatively, a selection is made between a plurality of coding modes, including at least one basic coding mode in which a residual is obtained from a predictor obtained in the enhancement layer, and at least one first interlayer encoding mode in which said residual is predicted using information from the reference layer, wherein evaluation of said first coding mode is adaptively controlled according to the results of the basic mode evaluation. Coding parameters such as block partitioning, prediction mode, coding sub-mode, motion values, direction, and predictor values are determined for the basic mode(s) and used to adaptively control evaluation of the interlayer mode(s).

Description

VIDEO CODING

The present invention relates to scalable video coding and decoding. More precisely, the present invention is directed to the generation and/or evaluation of Inter-Layer prediction modes. In particular embodiments, the invention reduces the complexity of such modes known as Inter Diff, Intra Dill or Generalised Inter-Layer Prediction (GRILP).

In scalable video codecs, there typically exist at least 2 representations of the same image, existing in what are commonly called layers. The first is the Base Layer, which represents the lowest quality of the video and is used as a reference layer to improve the coding efficiency for the other layers, called enhancement layers. A family of Inter-Layer prediction modes can perform temporal prediction by adding the temporal residue from the base layer to the temporal prediction generated in the enhancement layer.

The temporal prediction is often performed by defining at least one Motion Vector, which is the rational offset between an area to predict and a corresponding area in a reference image. The above-mentioned modes build on this basic premise by using the Motion Vector(s) in the Enhancement Layer to generate a first part of the prediction, and using its equivalent in the Base Layer to generate a second part.

This implies performing at least two Motion Compensations per area, and generally twice as much as classical temporal prediction methods. This is associated with a significant increase in complexity, in particular on the encoder side, as several such Motion Compensations are actually performed per area.

According to a first aspect of the invention there is provided a method for encoding image data according to a scalable encoding scheme having at least one enhancement layer and at least one reference layer, the method comprising.

for a unit of data of an enhancement layer to be encoded, evaluating a plurality of inter coding modes in which a residual is obtained from a temporal predictor, including at least a first encoding mode in which said residual is predicted using information from a reference layer, wherein fewer candidate predictor evaluations are performed for said first encoding mode than for other inter coding modes.

Said first encoding mode, which uses information from a reference layer (and also typically information from the enhancement layer also) is therefore a so called Inter-Layer mode, and the information used is typically temporal residual information from the reference layer. By using fewer predictor evaluations for the first, inter-layer, coding mode than for other inter modes, which do not use inter-layer residual prediction, but which still require evaluation of candidate predictors in an equivalent fashion, it has been found that the evaluation of such a mode can be simplified, and therefore made less computationally expensive.

In one embodiment, candidate predictor evaluations for the one or more other inter coding modes are reused for evaluations of said first coding mode.

In some embodiments therefore, candidate predictors are considered in turn, and for each candidate said first mode and said other modes are evaluated together, thereby sharing common data for multiple separate mode evaluations. In an example whereby a rate distortion cost is calculated, the distortion resulting from using a particular motion and frame reference can be used for both normal inter modes and inter-layer mode evaluation.

Alternatives to rate distortion calculations are possible, including the sum of absolute differences between a predictor block and an original, and an estimate of rate of motion coding information for example, and therefore different types of shared, or reused data are contemplated accordingly.

In one embodiment, candidate predictor evaluations are performed in a series of sub-piocesses of increasing refinement, and the resultant predictor output from at least one sub-process from evaluation of one of the other inter modes is used as the corresponding predictor for said first mode.

In a variation on such embodiments, in which candidate predictor evaluations are again performed in a series of sub-processes of increasing refinement, the resultant predictor output from at least one sub-process from evaluation of one of the other inter modes is used as an initial predictor, to define a block matching search window for said first mode.

For example, candidate predictor evaluations may involve separate searches for integer pixel, and sub-pixel levels, or equivalently an integer pixel search, and a sub-pixel refinement. Additionally or alternatively, a positional refinement may be performed for bi-directional prediction, over one or more uni-directional search or refinement steps.

According to embodiments then, the integer pixel search, or the sub-pixel refinement (which may be from one or more uni-directional determinations or reference lists) returned from the evaluation of a normal' inter mode is used also for the corresponding integer or sub-pixel result for the first, inter-layer, mode.

The result may be used directly, or it may serve as an initial value or estimate, about which further searching can be performed.

Embodiments have been proposed which make use of data or calculations which are generated in the course of evaluating normal' inter modes, to reduce the complexity of evaluation of an inter-layer mode.

In other embodiments a block matching search window is used for determining candidate predictors for the first mode, which is smaller than the equivalent search window used for other inter modes. The actual size of the search window used may be defined as a fraction of the equivalent search window for the other inter modes (eg 1/4 or 1116th of the size of the window for the corresponding other inter modes), or may be defined by an absolute pixel size, (such as 1 or 2

for example).

In some embodiments, the reduced search window size can be used in conjunction with an initial value or estimate, allowing positioning of the search window, derived according to embodiments described above.

In addition to using results of evaluation of normal inter modes directly as part of the evaluation of an inter-layer mode, such results can additionally or alternatively be used adaptively to control the evaluation of an inter-layer mode.

Thus embodiments may be provided in which candidate predictor evaluations performed for said first mode are varied according to one or more parameters determined from the evaluation of other modes.

The parameters may include whether or not an inter skip submode has been selected from said evaluation of said other modes and/or the type of partition selected from evaluation of said other inter modes.

The concept of adaptively controlling evaluation of interlayer modes according to the results of basic mode evaluation may also be applied to evaluation of basic (ie in-layer) spatial, as well as temporal coding modes. Accordingly, in a second aspect the invention provides a method for encoding image data according to a scalable encoding scheme having at least one enhancement layer and at least one reference layer, the method comprising, for a unit of data of the enhancement layer to be encoded, selecting between a plurality of coding modes, including at least one basic coding mode in which a residual is obtained from a predictor obtained in the enhancement layer, and at least one first interlayer encoding mode in which said residual is predicted using information from the reference layer, wherein intermediate coding information is determined as a result of evaluating candidate predictors for said basic mode, and wherein evaluation of said first coding mode is adaptively controlled according to said determined intermediate coding information.

The intermediate coding information determined for the basic mode or modes may include one or more of block partitioning, prediction mode, coding sub-mode, motion values, direction, and predictor values and the rate distortion performance of a predictor, the properties of the residual if any for example.

In a case where the coding scheme makes provision for a plurality of possible interlayer modes, only a subset of said interlayer modes may be evaluated according to said determined parameters. In one embodiment, no interlayer modes are evaluated if a skip sub-mode is determined on evaluation of said basic mode.

In certain embodiments, when the skip mode provides a distortion that is below a threshold, then other modes are not evaluated, and the determination of a skip mode can in itself be considered to provide the intermediate coding information, even though a residual has not strictly been calculated. In other embodiments, "basic" inter modes with residual are also subsequently tested and compared to the skip. Indeed a basic inter mode could provide a better rate distortion trade off than a skip, even if it is associated to a residual and a motion vector.

The invention also provides a computer program and a computer program product for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein, and a computer readable medium having stored thereon a program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein.

The invention extends to methods, apparatus and/or use substantially as herein described with reference to the accompanying drawings. Any feature in one aspect of the invention may be applied to other aspects of the invention, in any appropriate combination. In particular, features of method aspects may be applied to apparatus aspects, and vice versa. Furthermore, features implemented in hardware may generally be implemented in software, and vice versa. Any reference to software and hardware features herein should be construed accordingly.

Preferred features of the present invention will now be described, purely by way of example, with reference to the accompanying drawings in which: Figure 1 illustrates an example of a motion estimation algorithm in an encoder; IS Figure 2 illustrates the architecture of a video encoder; Figure 3 illustrates the architecture of a video decoder; Figure 4 illustrates the architecture of a scalable video encoder; Figure 5 illustrates the architecture of a scalable video decoder; Figure 6 shows a flow chart for GRILP motion compensation; Figure 7 illustrates a first way of implementing GRILP mode; Figure 8 illustrates an alternative way of implementing GRILP mode; Figure 9 illustrates a way of implementing Inter Diff mode; Figure 10 illustrates motion estimation for the inter-layer prediction modes; Figure 11 illustrates loop processing for each coding parameter in turn; Figure 12 illustrates evaluation of coding modes in a serial dependent way; Figure 13 illustrates intra coding A known video codec exploits both spatial and temporal correlations between pixels thanks to the Intra and Inter modes. Of particular interest to us, are the Inter modes, which exploit temporal correlation between pixels of the current frame and previous encoded/decoded frames, however Intra modes will also be considered.

In typical coding schemes such as HEVC or AVC, the Inter prediction can be unidirectional or bi-directional. The uni-directional case uses one predictor block to predict the current block. This predictor block is defined by a list index, a reference frame index and a motion vector. The list index corresponds to a list of reference frames. We can consider that two lists are used: LO and U. A list contains at least one reference frame and a reference can be included in both lists. The motion vector typically has two components: horizontal and vertical.

This corresponds to the spatial displacement in term of pixels between the current block and the temporal predictor block in the reference frame. So, the block predictor for the uni-directional prediction is the block from the reference frame (ref index) of the list, pointed to by the motion vector.

is For Bi-directional Inter prediction, two block predictors are considered. One for each list (LO and Li). Consequently, 2 reference frame indexes are considered as well as 2 motion vectors. The Inter block predictor for bi-prediction is a weighted average, pixel by pixel, of these two blocks pointed to by these two motion vectors. The motion information dedicated to the Inter block predictor can be defined by the following parameters: * One direction type: uni or bi * One list (uni-direction) or two lists (bi-direction): LU, LI, LO and Li.

* One (uni-direction) or two reference frame indexes (bi-direction): RefLO, RefLi, (RefLO, RefLi).

* One (uni-direction) or two (bi-direction) motion vectors: each motion vector has two components (horizontal mvx and vertical rnvy).

It is noted that the bi-directional Inter predictor could be used for the B slice type.

The Inter prediction in B slices can be uni-or bi-directional. For P slices, the Inter prediction is only uni-directional. Both P and B slices and both uni-and bi-directional Inter predictions are considered here in relation to the present invention.

Figure 1 illustrates the concept of Motion Estimation and Motion Compensation, used for temporal prediction in video codecs. Let us consider the image 102 to encode, and a reference image 101 (therefore of index RefLO or RefLi). 2 types of identically-sized blocks can then be defined: the block 105 to encode within image 102, and candidate blocks 103 and 104 in reference image 101, whose relative position is described by respectively motion vector 106 and motion vector 107.

Motion Compensation consists in computing blocks 103 or 104 given respectively 106 or 107. Because 106 or 107 are rational values, interpolation filters can be used to compute the value of 103 or 104 from actual values of image 101. The result of this operation is called the block predictor. When the Motion Vector does not have integer values, the Motion Compensation is said to be of sub-pixel precision.

Motion Estimation consists, given a means to perform motion compensation, in determining the Motion Vector from block 105 to a candidate block, whose corresponding block predictor minimizes a particular criterion. The criterion usually refers to the following information: * The distortion: this is a measure of similarity between block 105 and a block predictor -a typical metrics for this distortion are the Sum of Absolute Differences between pixels of 105 and their corresponding pixels in the block predictor, or the Sum of Squared Differences; * The rate: the rate corresponds to the amount of binary data used to encode the quantized transform coefficients corresponding to the prediction residual and to signal the use of a motion vector such as 106 or * A combination of the rate and the distortion called the rate/distortion cost.

The classical minimization problem is thus to minimize distortion under the constraint of rate. Classical methods such as the Lagrangian methods can be used.

Motion Compensation is usually controlled by the number of candidates to evaluate, and several algorithms exist to try and reduce the complexity of search.

As described previously, sub-pixel precision requires computation of the block predictor through interpolation filters, which is a costly process. Motion Estimation can therefore be split into a list of sub-processes illustrated in bottom part of figure 1: * Unidirectional estimation, split into integer accuracy and sub-pixel accuracy; * Bidirectional search; * Motion selection.

Operations 110 to 112 are respectively equivalent to operations 120 to 122, except for the reference list they are applied to, and various parameters that may control the amount of candidates to evaluate. We will therefore only describe operations 110 to 112 for [0, operations 120 to 122 being easily deduced.

Firstly, the search window for a given reference picture in list LO is determined.

Because motion estimation has been already applied to other blocks in the neighbourhood, a given ideal position can be estimated, and a small area around it searched. However, there are restrictions to the area: Motion Vectors are limited both in their length and the existence of data (as they could point outside of the reference image). Also, depending on various parameters (quantization, position in the list, ...), the area to search can be adapted. Finally, the position evaluation for steps 111/121, 112/122 or 130 has different complexity, and the to trade-off between coding efficiency and complexity results in different amounts of search for each type of step. Typically, the window used for 111/121 is bigger than 112/122, which in turn is equivalent or bigger than 130. All of this is taken into account during step 110 to determine the areas to be searched in each step.

Step ill applies any sort of Motion Estimation algorithm, from full search to diamond, Unsymmetrical-cross Multi-Hexagon-grid Search, Enhanced Predictive Zonal Search, ... searches known from prior art to find the predictor whose Motion Vector coordinates are integer and that minimizes the previously described criterion. A Motion Vector of integer precision is thus determined. Sub-pixel refinement 112 can occur around that vector, requiring interpolation of the block predictor. In addition, the criterion may be different in 111 and 112. This yields the best position with sub-pixel accuracy. Using the candidate MVs from LU and LI, a further refinement can occur in the case of bidirectional prediction: having the 2 best MVs for respectively list LU and list Ll, it is known that the 2 MVs for a bidirectional prediction are usually very close. A set of MVs for bidirectional prediction is thus determined in step 130.

Finally, the various motion hypothesis from LU, Li (unidirectional) and L14-LU (bidirectional) are compared, using again a criterion, and the best is kept as the final motion during step 140. This step can actually contain further processing, which can be (for example in the case of HEVC) the evaluation of particular motion subcases such as the merge or skip modes.

In addition, it is possible that a block can be further split into sub-blocks, those sub-blocks each having their own motion information. For instance, the 2 vertical parts of 108, or the 2 horizontal parts of 109, will be subject to the same processing as described previously.

Figure 2 illustrates a standard video encoding device, of a generic type, conforming to the HEVC or H.264/AVC video compression system. A block diagram of a standard HEVC or H.264/AVC encoder is shown. The input to this non-scalable encoder consists in the original sequence of frame images 201 to compress. The encoder successively performs the following steps to encode a standard video bit-stream. A first picture or frame to be encoded (compressed) is divided into pixel blocks, referred to as coding units in the HEVC standard The first picture is thus split into blocks or macroblocks 202. Each block first undergoes a motion estimation operation 203, which comprises a search, among the reference pictures stored in a dedicated memory buffer 204, for reference blocks that would provide a good prediction of the block. This motion estimation step provides one or more reference picture indexes which contain the found reference blocks, as well as the corresponding motion vectors. A motion compensation step 205 then applies the estimated motion vectors on the found reference blocks and copies the so-obtained blocks into a temporal prediction picture. Moreover, an Intra prediction step 206 determines the spatial prediction mode that would provide the best performance to predict the current block and encode it in INTRA mode.

Afterwards, a coding mode selection mechanism 207 chooses the coding mode, among the spatial and temporal predictions, which provides the best rate distortion trade-off in the coding of the current block. The difference between the current block 202 (in its original version) and the so-chosen prediction block (not shown) is calculated. This provides the (temporal or spatial) residual to compress. The residual block then undergoes a transform (DCI) and a quantization 208. Entropy coding 209 of the so-quantized coefficients QTC (and associated motion data MD) is performed. The compressed texture data 200 associated to the coded current block 202 is sent for output.

The current block is then reconstructed by inverse quantization and inverse transform (211). This step is followed (if need be) by a sum between the inverse transformed residual and the prediction block of the current block in order to form the reconstructed block. The reconstructed blocks are added to the buffer in order to form the reconstructed frame. Then this reconstructed frame is filtered.

The current HEVC standard includes 2 post-filterings, the deblocking filter (212) followed by the sample adaptive offset (SAO) (213). The reconstructed frame after these 2 post filters is stored in a memory buffer 204 (the DPB, Decoded Picture Buffer) so that it is available for use as a reference picture to predict any subsequent pictures to be encoded. Please note that the loop filtering can be applied block by block or LCU by LCU in the HEVC standard. But the post filtered pixels of LCU are not used as reference pixels for Intra prediction.

Finally, a last entropy coding step is passed the coding mode and, in case of an inter block, the motion data, as well as the quantized DCT coefficients previously calculated. This entropy coder encodes each of these data into their binary form and encapsulates the so-encoded block into a container called NAL unit (Network Abstract Layer). A NAL unit contains all encoded coding units from a given slice.

A coded HEVC bit-stream consists in a series of NAL units.

Figure 3 provides a block diagram of a standard HEVC or H.264/AVC decoding system. This decoding process of a bit-stream 301 starts by entropy decoding 302 of each block (array of pixels) of each coded picture in the bit-stream. This entropy decoding provides the coding mode, the motion data (reference pictures indexes, motion vectors of INTER coded blocks), residual data and SAO filter parameters. The residual data consists in quantized and transformed DCT coefficients. Next, these quantized DCT coefficients undergo inverse quantization and inverse transform operations 303.

The decoded residual is then added to the temporal (Inter) (304) or spatial (Intra) (305) prediction block of the current block, to provide the reconstructed block.

The prediction mode information which is provided by the entropy decoding step mode extracted from the bitstream indicates if the current block is Intra or Inter (309).

The reconstructed block finally undergoes one or more in-loop post-filtering processes, e.g. deblocking (306) and SAO (310), which aim at reducing the blocking artifacts inherent to any block-based video codec (deblocking), and improve the quality of the decoded picture. The full post-filtered picture is then stored in the Decoded Picture Buffer (DPB), represented by the frame memory (307), which stores pictures that will serve as references to predict future pictures to decode. The decoded pictures (308) are also ready to be displayed on screen.

Figure 4 illustrates a block diagram of a scalable video encoder, which comprises a straightforward extension of the standard video coder of Figure 1, towards a scalable video coder. This video encoder may comprise a number of subparts or stages, illustrated here are two subparts or stages AlO and BlO producing data corresponding to a base layer 413 and data corresponding to one enhancement layer 414. Each of the subparts Al 0 and BI 0 follows the principles of a standard video encoder such as HEVC or AVC, with the steps of transformation, quantization and entropy coding being applied in two separate paths, one corresponding to each layer.

The first stage BlO aims at encoding the H.2641AVC or HEVC compliant base layer of the output scalable stream, and hence is (or can be) identical to the encoder of Figure 1. Next, the second stage AlO illustrates the coding of an enhancement layer on top of the base layer. This enhancement layer brings a refinement of the spatial resolution to the (down-sampled 417) base layer. As illustrated in Figure 4, the coding scheme of this enhancement layer is similar to that of the base layer, except that for each coding unit of a current picture 401 being compressed or coded, an additional prediction mode can be chosen by the coding mode selection module 425. This new coding mode corresponds to inter-layer prediction 416. Inter-layer prediction 416 consists in using the data coded in a layer lower than the current refinement or enhancement layer, as prediction data of the current coding unit. The lower layer used is called the reference layer for the inter-layer prediction of the current enhancement layer. In case the reference layer contains a picture that temporally coincides with the current picture, then it is called the base picture of the current picture. The co-located block (at same spatial position) of the current coding unit that has been coded in the reference layer can be used as a reference to predict the current coding unit.

More precisely, the prediction data that can be used in the co-located block corresponds to the coding mode, the block partition, the motion data (if present) and the texture data (temporal residual or reconstructed block). In case of a spatial enhancement layer, some up-sampling 418 operations of the texture and prediction data are performed.

Figure 5 presents a block diagram of a scalable decoder 500 which would apply on a scalable bit-stream made of two scalability layers, e.g. comprising a base layer and an enhancement layer. This decoding process is thus the reciprocal processing of the scalable coding process of Fig. 4. The scalable stream being decoded 510, is made of one base layer and one spatial enhancement layer on top of the base layer, which are demultiplexed 520 into their respective layers.

The first stage of Figure 12 concerns the base layer decoding process B12. As previously explained for the non-scalable case, this decoding process starts by entropy decoding 530 each coding unit or block of each coded picture in the base layer. This entropy decoding 530 provides the coding mode, the motion data (reference pictures indexes, motion vectors of INTER coded macroblocks) and residual data. This residual data consists of quantized and transformed DCT coefficients. Next, these quantized DCT coefficients undergo inverse quantization and inverse transform operations 531. Motion compensation 532 or Intra prediction 533 data can be added to the inverse quantized residual through operation 534. Deblocking 535 is applied. The so-reconstructed residual data is then stored in the frame buffer 540.

Next, the decoded motion and temporal residual for INTER blocks, and the reconstructed blocks are stored into a frame buffer in the first of the scalable decoder of Figure 5. Such frames contain the data that can be used as reference data to predict an upper scalability layer.

Next, the second stage of Figure 5 performs the decoding of a spatial enhancement layer Al 2 on top of the base layer decoded by the first stage. This spatial enhancement layer decoding involves the entropy decoding of the second layer 550, which provides the coding modes, motion information as well as the transformed and quantized residual information of blocks of the second layer.

The next step consists in predicting blocks in the enhancement picture. The choice 551 between different types of block prediction (INTRA, INTER or inter-layer) depends on the prediction mode obtained from the entropy decoding step 550.

Concerning INTRA blocks, their treatment depends on the type of INTRA coding unit.

-In case of inter-layer predicted INTRA block (lntra-BL coding mode), the result of the entropy decoding 550 undergoes inverse quantization and inverse transform 552, and then is added by operation 553 to the co-located block of current block in base picture, in its decoded, post-filtered and up-sampled (in case of spatial scalability) version.

to -In case of a non-lntra-BL INTRA block, such a block is fully reconstructed, through inverse quantization, inverse transform to obtain the residual data in the spatial domain, and then INTRA prediction 554 to obtain the fully reconstructed block.

Concerning INTER blocks, their reconstruction involves their motion-compensated 555 temporal prediction computed from frame memory 517, the residual data decoding and then the addition of their decoded residual information to their temporal predictor. In this INTER block decoding process, inter-layer prediction can be used in two ways. First, the motion vectors associated to the considered block can be decoded in a predictive way, as a refinement of the motion vector of the co-located block in the base picture.

Second, the temporal residual can also be inter-layer predicted form the temporal residual of the co-sited block in the base layer.

As in the base layer, post filtering 556 can be applied to the images output from 553, and stored in frame memory 560 before being returned by the decoded as fully decoded frames 570. Note that in the particular scalable coding mode targeted by the invention, motion compensation 555 uses data from the Enhancement Layer image buffer 560 and from the Base Layer image buffer 550.

As described in previous paragraph, the enhancement layer in scalable video coding can use data from the base layer for classical Intra and Inter coding. The modes which use data from the base layer are known as Inter layer. In the state of the art several Inter layer modes or Hybrid Inter layer and Intra or Inter coding modes were defined. The family of Inter-Layer prediction modes which can compete with the modes described above is now described. These modes are sometimes referred to as Generalised Inter Layer Prediction (GRILP).

The decoding of such modes is illustrated by Figure 6. The bit stream comprises the means to locate the predictor and the second order residual. In a first step 6.1, the location of the predictor used for the prediction of the coding unit and the associated residual are obtained from the bit stream. This residual corresponds to the second order residual obtained at encoding. In a step 6.2, the co-located predictor is determined. It is the location in the reference layer of the pixels corresponding to the predictor obtained from the bit stream. In a step 6.3, the co- located residual is determined. It is defined by the difference between the co-located coding unit and the co-located predictor in the reference layer. In a step 6.4, the first order residual block is reconstructed by adding the residual obtained from the bit stream which corresponds to the second order residual and the co-located residual. Once the first order residual block has been reconstructed, it is used with the predictor which location has been obtained from the bit stream to reconstruct the coding unit in a step 6.5.

The following equation schematically describes the process to generate the EL prediction signal PREDEL: PREDEL = MC1[ REFEL, MVEL] + { UPS[ RECBL] -MC2[ UPS[ REFBL], MVEL] } In this equation: * PREDEL corresponds to the prediction of the EL coding unit being processed, * RECBL is the co-located block from the reconstructed BL picture, corresponding to the current EL picture, * MVEL is the motion vector used for the temporal prediction in the EL * REFEL is the reference EL picture, * REFBL is the reference BL picture, * UPS[x] is the upsampling operator performing the upsampling of samples from picture x; it applies to the BL samples * MC1[xy] is the EL operator performing the motion compensated prediction from the picture x using the motion vector y to * MC2[x,y] is the BL operator performing the motion compensated prediction from the picture x using the motion vector y This is illustrated in Figure 7. Let us consider that the final block in the EL picture is of size H lines x W columns. It's corresponding block in the BL picture is of size h lines x w columns. W/w and H/h correspond to the inter-layer spatial resolution ratios. The block 708 (of size HxW) is obtained by motion compensation MC1 of a block 706(of size HxW) from the reference EL picture REFEL 701 using the motion vector MVEL 707. The block 709 (of size HxW) is obtained by motion compensation MC2 of a block 710 (of size HxW) of the upsampled reference BL picture 702 using the same motion vector MVEL 707. The block 710 has been derived by upsampling the block 711 (of size hxw) from the BL reference picture REFBL 703. The block 712 (of size HxW), in the upsarnpled BL picture 704, is the upsampled version of the block 713 (of size hxw) from the current BL picture RECBL 705. Samples of block 709 are subtracted to samples of block 712 to generate the second order residual, which is added to the block 708 to generate the final EL prediction block 714.

In a specific implementation, which is advantageous in terms of memory saving, the first order residual block in the reference layer may be computed between reconstructed pictures which are not up-sampled, thus are stored in memory at the spatial resolution of the reference layer. The computation of the first order residual block in the reference layer then includes a down-sampling of the motion vector considered in the enhancement layer, towards the spatial resolution of the reference layer. The motion compensation is then performed at reduced resolution level in the reference layer, which provides a first order residual block predictor at reduced resolution.

Last inter-layer residual prediction step then consists in up-sampling the so-obtained first order residual block predictor, through a bi-linear interpolation filtering for instance. Any spatial interpolation filtering could be considered at this step of the process (examples: 8-Tap DCTIF, 6-tap DCT-IF, 4-tap SVC filter, bi-linear). This last embodiment may lead to slightly reduced coding efficiency in the overall scalable video coding process, but does not need additional reference picture storing compared to standard approaches that do not implement the present embodiment. This corresponds to the following equation: PREDEL = MC1[ REFEL, MVE[] ÷ { UPS[ RECBL-M04[ REFBL, MVE[xR]] } This equation is referred to as eqi in the following description, with R related to the ratio between the reference and base layers dimensions, e.g. in case of spatial scalability of ratio 2.0, R= 1/2.

This is illustrated in figure 8. The block 808 (of size HxW) is obtained by motion compensation MC1 of a block 804 (of size HxW) of the reference EL picture REFEL 801 using the motion vector MVEL 806. The block 809 (of size hxw) is obtained by motion compensation MG4 of a block 805 (of size hxw) of the reference BL picture REFBL 802 using the downsampled motion vector MVEL 807.

This block 809 is subtracted to the BL block 810 (of size hxw) of the BL current picture RECBL 803, collocated with the current EL block, to generate the BL residual block 811 (of size hxw). This BL residual block 811 is then upsampled to obtain the upsampled residual block 812 (of size HxW). The upsampled residual block 812 is finally added to the motion compensated block 808 to generate the prediction PREDEL 813.

Another alternative to generate the block predictor is to weight each part of the linear combination given in the previous equation. Consequently, the generic equation becomes: PREDEL = MC1[ REFEL, MVEL] + a{ UPS[ RECBL-MC4[ REFBL, MVEL/2]]} I0 This equation is referred to as eq2 in the following description.

Typical values for a are 0 (no residual), 1 (known from prior art) and 1/2. It is noted that another way to produce candidates is to combine combinations of such candidates. For instance, we can create the predictor ( PREDELU=I + UPS[ RECBL] ) /2, which has the advantage to be computable on the fly, having pre-computed PREDELU=I and UPS[ RECBL]. In such a case we can further rewrite the above formula: PREDEL = XMC1[ REFEL, MVEL] + UPS[ I3RECBL-yMC4[ REF9[ MVEL/2]] The previous example uses the above formula with values 2=1/2, 3=1, y=l/2.

This equation is referred to as eq2 bis in the following description.

It is important to note that in addition to the upsampling and motion compensation processes mentioned above, some filtering operations may be applied to the intermediate generated blocks. For instance, a filtering operator FILTX (x taking several possible values for different filters) can be applied right after the motion compensation, or right after the upsampling or right after the second order residual prediction block generation. Some examples are provided in the following equations: PRED, = MC1[ MV1,] + { UPS[ RECEE] FILT1( MC,i LiNE RFFBT] , MV] )J I'REDEL = UPS[ RECBL] + FILT1( MC[ REFEL-UI'S[ REFBL] MVEL]) PREDEL = MC1 [ REFEL, MVEL I + FILT1( UPS[ RECBL MC4[ REFBL, MVEL/2]] }) tO PREDpJ, = FlLT( MC1[ MYFT,] ) + { UPS[ REC] FILT1( MC,i UPS[ REFET,] , MVpr] )} PREDhL = F]ILT2( UPS[ RECBL]) -F FILTI( MC3[ REFEL-UPS[ REFBL] , MVEL]) PREDEL = FILT2( MC1I REFEL, MVEL]) -F FILT1( { UPS[ RECUL-MC4[ REFilL, MVEL/2 I] I) The different processes involved in the prediction process, that is, upsampling, motion compensation, and possibly filtering, are achieved using linear filters applied using convolution operators.

Finally, whether a block uses a normal inter mode or one of the above can be signalled at various levels of the syntax. For instance, in HEVC, this may be signalled at the CU or PU level.

So called Inter Duff mode is another example of inter-layer prediction mode using temporal prediction. The underlying premise here is to predict the residual between the enhancement layer and the base layer. An example is provided with reference to Figure 9.

At the encoder, a decoded image 910 from the base layer is upsampled to generate an upsampled base-layer image 911 so as to match the enhancement layer properties (component sizes and phases). An "original" residual image 912 is then generated by subtracting this image 911 from original image 930. Of course this process can occur on a block basis, so for a given block 931 to encode, the collocated block 920 in image 910 is determined, then upsampled to produce upsampled block 921, which is then subtracted from original block 931 to produce differential block 922. Given samples of dynamics [0, N], the subtracting can produce images whose samples are in the range [-N, N].

However, as 930 and 911 are supposed to be rather similar, the samples may more probably be centred around 0, and adding N/2 to the residual will generate data in the range [-N12,3N12] than can safely be clipped to [0,N].

For the reference data, given reference images 900 in the base layer and 901 in the enhancement layer, reference differential image 903 can be generated by first generating upsampled image 901 from image 900, then applying the same subtracting means between 902 and 901. The mode then consist in applying the process described in Figure 1 to find best candidate differential block 930 for differential block 922, yielding motion information 940 (here a simple MV, but the same concept apply to bidirectional prediction). Block 930 can then be added in reverse fashion to 921 to generate the prediction 915 for 922.

This is very similar to GRILP, but with a different trade-off between complexity and memory use. Indeed, inter diff mode requires having both the normal and differential reference images available, effectively doubling the memory for reference images, while GRILP performs twice as much motion compensation.

Figure 13 illustrates the use of the reference layer in the context of intra coding.

In intra coding the current image is the only one used for encoding. It is composed of a picture representation 13.5 in the reference layer and a picture representation 13.1 in the enhancement layer. The bottom parts of Figure 13 illustrates the basic concept: the predictor is obtained from the left and above border pixels 13.11 of the coding unit 13.12 and mode information, for instance the directions 13.13 or 13.14. Many methods are known in the state of the art, of which the most common modes are: DC prediction: the average over a neighbourhood of the samples values is computed, and used as the predictor value for all samples of the target area; * Angular prediction: a list of directions exists, and each direction determines an interpolation process; * Planar prediction: rather than simple propagation, figurative pixels are computed from neighbouring sample values Consider the coding unit to be encoded is the block 13.2. The interlayer spatial modes (known under the moniker intra diff) work according to the same principle as GRILP or inter diff except temporal predictors are replaced by spatial predictors. Predictors are blocks of the same size as the coding unit to encode obtained with reference to a set of neighbouring pixels. The prediction in the enhancement layer taking into account a spatial GRILP prediction, has determined predictor pixels 13.3 and a prediction mode 13.4. The prediction direction plays the role of the motion vector in the inter coding. They constitute both means to locate the predictor. The encoding of the reference layer has IS determined for the co-located coding unit 13.6 pixel predictors 13.7 and a prediction direction 13.8. The co-located predictor 13.9 is determined in the reference layer with the corresponding prediction direction 13.10. Similarly to the inter coding the prediction direction and the predictor obtained in the different layers may be correlated or not. For the sake of clarity, Figure 13 illustrates a case where both the predictor and the prediction direction are clearly different.

Similarly to the method described for inter coding, the co-located residual is computed in the reference layer as the difference between the co-located coding unit 13.6 and the predictor obtained from the co-located border pixels 13.9 using the prediction direction 13.3 determined in the enhancement layer. This co-located residual is used as a predictor for the first order residual block obtained in the enhancement layer. This prediction of the first order residual block leads to a second order residual which is embedded in the stream as the result of the encoding of the coding unit.

A first embodiment of the present invention can be described with reference to Figure 1, however, for clarity sake, the relevant portion is reproduced as Figure 10. In this first embodiment, the inter-layer modes are evaluated at the same time as the normal inter mode. This may happen during any of the sub-processes: * The integer position searches 1011 and/or 1021; * The subpixel position refinement 1012 and/or 1022; * The subpixel position refinement 1030.

The two latter cases can be more important because the interpolation required is costly, and reusing the resulting interpolated candidate block saves on computation. For each of the inter-layer and inter modes considered, a separate best candidate is stored.

Figure 11 describes the first embodiment in greater detail. This process can be inserted in a loop evaluating for each possible coding parameter the performance of each coding mode: here the coding modes are normal (or basic) inter mode and inter layer modes. The coding parameters include: * The Target area / partition size: indeed, the area predicted may not be a square block, as shown in AVG or HEVG, where Nx2N, 2NxN, etc partition types exist; (wherein N is a unitary size for a coding unit).

* The Reference list and the reference frame: as per a classical video codec.

* Quantization parameter The process starts with step 1100 with the initialization to a set of coding parameters Step 1101, determines the size of the area of candidates (either using systematically the same size or a size depending on the parameters). A search algorithm is then defined, for instance a full search or a fast search, Enhanced Predictive Zonal Search, hexagon-based, etc. This has a bearing firstly on step 1102, where the initial position for the candidate is evaluated. Step 1103 is also affected by this: this step determines, based on a criterion linked to the type of search (number of iterations, improvement over previous iteration, etc), whether there are positions left to evaluate. If this is not the case, the whole evaluation s process ends with step 1110, which yields the best interlayer mode and position for the current parameters. It can then be followed by either the final choice or any of the steps in Figure 10.

In the case that the last position has not been reached (negative determination in step 1103), normal inter mode evaluation starts. This is equivalent to computing a predictor as described in classical motion compensation described above.

However, because this step is generally present also for the interlayer temporal modes, this result can be saved for later reuse. Therefore, to evaluate those interlayer modes, the only part left is to generate their corresponding residuals (e.g. 812 in Figure 8) in step 1105, which can then be added to the normal inter layer predictor to generate the actual interlayer modes predictors in step 1106.

Now that all predictors are known, a typical evaluation will continue with step 1107, which evaluates the distortion, or difference between the predictors and the original values of the samples in the target area. This may use a variety of similarity metrics, examples of which include the Sum of Squared Errors (SSE), the Sum of Absolute Differences (SAD), the Sum of Absolute Transformed Differences (SATD) or the Structural Similarity (SSIM). For a full evaluation, the difference between original and predictor is transformed, quantized (whose results in quantized coefficients, which will be reused in 1108), unquantized and inverse transformed to generate the real decoded difference (or an equivalent, but that does not impact the invention). In that case, the same type of distortion can be measured but on the decoded difference instead.

Step 1108 performs the final evaluation of the normal and interlayer modes. In the preferred embodiment, a Lagrangian cost is used to minimize distortion under a constraint of rate. This can be a simple evaluation of the cost of the encoding of the parameters of the predictor (partition size!...) and its motion, in which case it is similar for all modes at this stage, but will differ for other values of these parameters. Therefore, this calculation can be factorized. For a full estimation, this also involves evaluating the rate used to encode the quantized coefficients found in step 1107.

The position evaluation loop then proceeds to step 1109 which sets the next position to be checked according to the type of search.

It is also important to note that the figurative steps of generating predictors can be understood as applicable to cases where evaluation is made "on the fly", i.e. the predictor does not exist per se, but instead its pixels are iteratively computed and used to generate the distortion/cost values. In that case, a given pixel value of the normal inter mode can be saved to also compute the pixel value of the other interlayer modes predictors.

In a further embodiment the evaluation of the inter-layer mode is restricted at each stage of search: * It can be completely avoided on steps 1011 and 1021, in which case the following steps are initialized with the motion information found for inter on step 1011 and 1021; * It can be also be avoided on steps 1012 and 1022 by simply reusing the positions found for inter mode; * It can be also be avoided on step 1030 by simply using the result of 1030 for the inter mode In the case of step 1030, as bidirectional prediction is more complex than the unidirectional one, it may even be optional to use the inter-layer prediction modes in case of bidirectional prediction. This is a case for a further embodiment, where: * A high-level syntax element (the profile, or a flag in the Sequence or Picture Parameter Sets, or in the image or slice header) signal whether such modes are active on bidirectional predictions; In addition, this high level syntax element allows the signalling of the inter-layer modes to be adapted by a frame area (for instance PU, CU, LCU) level syntax element: For instance, in an embodiment simply skip writing of the frame area level syntax element when the high level syntax element indicates no inter layer prediction; In the last case, if we imagine the following value for the signalling of the inter-layer mode: * Bit sequence "0" of length I for normal inter mode * Bit sequence "10" of length 2 for a first interlayer mode (e.g. GRILP with a=1 in equation 2) * Bit sequence "11" of length 2 for a first interlayer mode (e.g. GRILP with a112 in equation 2) Then we can imagine attributing at least 2 contexts for a first bit depending on whether the block is unidirectional, merge or bidirectional. Equivalently the second bit can have several contexts defined to take this into account.

A still further embodiment restricts the evaluation of the inter layer modes to a specific area around the motion found at either type of search (integer, sub-pixel, or sub-pixel bidirectional: * The refinement can start after steps 1012 and/or 1022; * The refinement may only happen with the result of 1030; The refinement may happen on steps 1012, 1022 and 1030, and the best candidate then evaluated.

A yet further embodiment applies any of the techniques of previous embodiments according to motion properties identified during previous evaluations: * The results of the inter mode for a current set of coding parameters, may have produced a good match (distortion below a threshold, size of data needed to represent the residual below a threshold or "inter skip" sub-mode appears to provide the best rate/distortion cost). In such cases evaluation of the inter-layer mode can be omitted.

* The type of partition evaluated appears to provide a good rateldistortion cost indicating that testing of further partitioning of inter-layer mode can be omitted; * The result of any prediction mode previously evaluated on a partition containing the current one (for instance, the "inter skip" sub-mode may have already been selected for another partitioning of the current CU) Any of the previous strategies to reduce number of candidate evaluations for said at least one first mode can then be adaptively applied according to these properties, e.g.: -The window size for these evaluations can be shortened -Whether those evaluations occur starting on 1011/1021, 1012/1022 or Figure 12 illustrates an embodiment of the invention whereby coding modes are evaluated in a serial, dependent way, and in which the resulting candidate for a given evaluation may be reused for, or constrain the evaluations of an interlayer mode.

Step 1200 performs a search and evaluation of a candidate for a normal (non-interlayer) mode, be it temporal or spatial. This may comprise: * For a temporal mode, any of the steps 1011/2012, 1012/1022 and/or 1030 from Figure 10; * evaluating several parameters for the mode at a time: o For a temporal mode, the various ways of partitioning the target area (i.e. select for a 2Nx2N block whether it is 2Nx2N, NxN, etc...); o For a spatial mode, all the normal spatial (angular, DC, planar, ...) modes are evaluated, and the best normal mode is found.

This step may yield or initialize the common parameters for the mode, such as the area/partition size, the candidate motion (one or more motion vectors), etc..., as well as provide information related to the evaluation such as the value of the cost, the similarity metric, coding results such as a sub-mode (e.g. inter skip/merge) or indicative of the residuals (e.g. the coded block pattern or flags in various standards).

Step 1201 then determines the interlayer modes that will be tested. This step takes into account the results of 1200 and based on such results, can omit all or part of these modes. For example: Now that the list of interlayer modes to evaluate is available, a first one is selected during step 1202 to initialize the iterative evaluation. Step 1201 may determine that no mode interlayer should be tested, so step 1203 evaluates whether there are modes left to be evaluated. If not, then all modes have been evaluated, and evaluation stops at 1205, leading to (for temporal modes) any of steps 1012/1022, 1030 or 1040 for instance.

Otherwise, the current mode is to be evaluated, and the processing continues with step 1205. This step employs the results of step 1200 to control the evaluations -for example: * For temporal modes, the search may be altered: given that step 1200 offers the best candidate motion, a smaller search (e.g. smaller window) tO can be performed; * In any case, as this is a refinement, the evaluation can use more precise means (e.g. using SAID instead of SAD for similarity metric); * If several "best" candidates are saved, then only their equivalent in the interlayer case can be investigated -this is particularly true of the spatial modes.

Now that the search parameters are all determined, the search for the current mode can be performed, yielding evaluations results that can be further used, e.g. to determine whether it is a better coding mode, or if it is worth evaluating the remaining interlayer modes. To this end, step 1207 selects the next interlayer mode to be evaluated. Because of this iteration's result, it may be decided that the best match has already been found, or that overall the maximum amount of candidate evaluations across all modes have been reached. Therefore, the process loops back to step 1203 to check whether there is any interlayer mode left to evaluate.

It will be understood that the present invention has been described above purely by way of example, and modification of detail can be made within the scope of the invention.

Each feature disclosed in the description, and (where appropriate) the claims and drawings may be provided independently or in any appropriate combination.

Claims

CLAIMS1. A method for encoding image data according to a scalable encoding scheme having at least one enhancement layer and at least one reference layer, the method comprising, for a unit of data of an enhancement layer to be encoded, evaluating a plurality of inter coding modes in which a residual is obtained from a temporal predictor, including at least a first encoding mode in which said residual is predicted using information from a reference layer; wherein fewer candidate predictor evaluations are performed for said first encoding mode than for other inter coding modes.
2. A method according to Claim 1, wherein candidate predictor evaluations for said other inter coding modes are reused for evaluations of said first coding mode.
3. A method according to Claim 1 or Claim 2, wherein for each candidate predictor, said first mode and said other modes are evaluated at substantially the same time.
4. A method according to Claim 2 or Claim 3, wherein pixel values of a predictor for said other modes are iteratively computed, and wherein such pixel values are stored and used to compute equivalent values for said first interlayer mode.
5. A method according to any preceding claim, wherein candidate predictor evaluations are performed in a series of sub-processes of increasing refinement, and wherein the resultant predictor output from at least one sub-process from evaluation of said other inter modes is used as the corresponding predictor for said first mode.
6. A method according to any preceding claim, wherein candidate predictor evaluations are performed in a series of sub-processes of increasing refinement, and wherein the resultant predictor output from at least one sub-process from evaluation of said other inter modes is used as an initial predictor, to define a search window for said first mode.
7. A method according to Claim 5 or Claim 6, wherein said sub-processes include integer pixel search and sub-pixel search
8. A method according to any one of Claims 5 to 7, wherein said sub-processes include unidirectional search and bi-directional search.
9. A method according to any preceding claim, wherein a search window is used for determining candidate predictors, and wherein the search window for said first mode is smaller than the equivalent search window used for other inter modes
1O.A method according to Claim 8, wherein the search window used for said first mode is a specified fraction of the size of the equivalent search window used for the other inter modes.
11.A method according to any preceding claim, wherein evaluation of each mode comprises calculating a rate distortion cost.
12.A method according to any preceding claim, wherein the candidate predictor evaluations performed for said first mode are controlled according to one or more parameters determined from the evaluation of other modes.
13.A method according to Claim 12, wherein said parameters include at least one of distortion, block partitioning type, prediction mode or sub-mode, and motion values.
14.A method according to Claim 12 or Claim 13, wherein said parameters include whether or not an inter skip sub-mode has been selected from said evaluation of said other inter modes.
15.A method according to any one of Claims 12 to 14, wherein said parameters include whether or not a distortion value or residual size determined for said other modes exceeds a threshold.
16.A method according to any one of Claims 12 to 15, wherein controlling candidate predictor evaluations for said first mode comprises omitting said evaluations
17.A method for encoding image data according to a scalable encoding scheme having at least one enhancement layer and at least one reference layer, the method comprising, for a unit of data of the enhancement layer to be encoded, selecting between a plurality of coding modes, including at least one basic coding mode in which a residual is obtained from a predictor obtained in the enhancement layer, and at least one first interlayer encoding mode in which said residual is predicted using information from the reference layer, wherein intermediate coding information is determined as a result of evaluating a mode other than said first mode, and wherein evaluation of said first coding mode is adaptively controlled according to said intermediate coding information.
18.A method according to Claim 17, wherein said intermediate coding information includes one or more of: block partitioning type, prediction mode or sub-mode, motion values, direction, predictor values, and residual values.
19.A method according to Claim 17 or Claim 18, wherein a plurality of first interlayer modes are selectable according to said scheme, and wherein only a subset of said interlayer modes are evaluated according to said determined intermediate coding information.
20.A method according to Claim 17 or Claim 18, wherein a plurality of first interlayer modes are selectable according to said scheme, and no interlayer modes are evaluated if a skip sub-mode is determined on evaluation of said other mode.
21.A method according to any one of Claims 17 to 20, wherein candidate predictor evaluations are performed in a series of sub-processes of increasing refinement, and wherein the resultant predictor output from at least one sub-process from evaluation of said other mode is used as the corresponding predictor for said first interlayer mode.
22.A method according to any one of Claims 17 to 20, wherein candidate predictor evaluations are performed in a series of sub-processes of increasing refinement, and wherein the resultant predictor output from at least one sub-process from evaluation of said other mode is used as an initial predictor, to define a search window for said first interlayer mode.
23.A method according to any one of Claims 17 to 21, wherein said intermediate coding information comprises distortion or residual size, and wherein evaluation of said first coding mode is adaptively controlled according to whether or not said distortion or residual size exceeds a threshold.
24.A method according to Claim 23, wherein evaluation of said first coding mode is omitted if said distortion or residual size does not exceed a threshold.
25.Apparatus for encoding image data according to a scalable encoding scheme having at least one enhancement layer and at least one reference layer, the apparatus comprising, for a unit of data of an enhancement layer to be encoded: means adapted to evaluate a plurality of inter coding modes in which a residual is obtained from a temporal predictor, including at least a first encoding mode in which said residual is predicted using information from a reference layer; wherein said means is adapted to perform fewer candidate predictor evaluations for said first encoding mode than for other inter coding modes.
26.Apparatus for encoding image data according to a scalable encoding scheme having at least one enhancement layer and at least one reference layer, the apparatus comprising, for a unit of data of the enhancement layer to be encoded, means adapted to select between a plurality of coding modes, including at least one basic coding mode in which a residual is obtained from a predictor obtained in the enhancement layer, and at least one first interlayer encoding mode in which said residual is predicted using information from the reference layer, means for determining intermediate coding information as a result of evaluating candidate predictors for a mode other than said first mode, and means for controlling the evaluation of said first coding mode controlled according to said determined intermediate coding information
27. Method or apparatus as hereinbefore described, and with reference to the accompanying drawings.