GB2516223A

GB2516223A - Method and apparatus for video coding and decoding

Info

Publication number: GB2516223A
Application number: GB1312320.3A
Authority: GB
Inventors: Miska Matias Hannuksela; Payman Aflaki Beni
Original assignee: Nokia Oyj
Current assignee: Nokia Oyj
Priority date: 2013-07-09
Filing date: 2013-07-09
Publication date: 2015-01-21
Also published as: GB201312320D0

Abstract

An apparatus, for stereoscopic (3DTV) video encoding, comprises: a depth data generator determining ranging information for a set of views; means to determine a unit (e.g. a subset of views, a set of pictures within each view or a spatial region within a picture) associated with the set of views; a depth data validity determiner determining the validity of ranging information for the unit; and an indicator inserter to generate at least one indicator based on ranging information validity. Also claimed is a method comprising receiving a signal comprising at least one indicator based on ranging information validity, and controlling ranging information utilisation based on this indicator. Determining the validity of ranging information may comprise: determining ranging information is not output from decoding; ranging information being marked unavailable or invalid; hole filling employment; ranging information not to be used for depth-image based rendering; determining that ranging information can be replaced with a depth estimation algorithm output from decoded texture views; ranging information being partially unavailable / invalid with respect to a first or second portion of views; ranging information being (un)available / (in)valid for first or second processing operations; the quality of ranging information being below a threshold, etc.

Description

METHOD AND APPARATUS FOR VDEO CODING AND DECODING

TECHNICAL FIELD

The present application relates generafly to an apparatus, a method and a computer program for video coding and decoding.

BACKGROUND

This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarHy ones that have been previou&y conceived or pursued.

Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by indusion in this section.

A video coding system may comprise an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form, for example, to enabie the storage/transmission of the video information at a lower bitrate than otherwise might be needed.

Scalable video coding refers to a coding structure where one bitstream can contain multiple representations of the content at different bitrates, resolutions, frame rates and/or other types of scalability. A scalable bitstream may consist of a base layer providing the lowest quality video avaVable and one or more enhancement layers that enhance the video quabty when received and decoded together with the lower layers. In order to improve coding efficiency for the enhancement layers! the coded representation of that layer may depend on the lower layers. Each layer together with all its dependent layers is one representation of the video signal at a certain spatial resolution, temporal resolution, quality level, and/or operation point of other types of scalability.

Various technologies for providing threedimensional (3D) video content are currently investigated and developed. Espedafly, intense studies have been focused on various multiview applications wherein a viewer is able to see only one pair of stereo video from a specific viewpoint and another pair of stereo video from a different viewpoint. One of the most feasible approaches for such multiview applications has turned out to be such wherein only a limited number of input views, e.g. a mono or a stereo video plus some supplementary data, is provided to a decoder side and aH requ red views are then rendered (i.e. synthesized) locally by the decoder to be displayed on a display.

In the encoding of 3D video content, video compression systems, such as Advanced Video Coding standard ft2$4IAVC, the MuftMew Video Coding MVC extension of H,264fAVC or scalable extensions of HEVC can be used.

SUMMARY

According to a Iirst aspect there is provided a method comprising: determining ranging information for a set of views; determining a unit associated with the set of views; determining the validity of the ranging information for the unit; and generahng at east one indicator based on the validity of the ranging information.

Determining the unit associated with the set of views may comprise determining at east one of: a subset of views as the unit; a set of pictures within each view in the set of views as the unit; and a spatial region within at least one picture within the set of pictures as the unit.

Determining the validity of the ranging information for the unit further may comprise at least one of: determining the ranging information for the associated unit is not output from a decoding process; determining the ranging information for the associated unit is marked unavailable or invalid; determining a hole filling process is to be employed for the ranging information for the associated unit; determining the ranging information for the associated unit is not to be used for depthimage based rendering; determining the ranging information for the associated unit can be replaced with an output of a depth estimation algorithm from decoded texture views in a decoder section; determining the ranging information for the associated unit is unavaliable or invalid; determining the ranging information for the associated unit is partiafly unavailable or invalid with respect to a first portion of the views; determining the first portion of the views for which ranging information for the associated unit is partiafly unavaable or invalid; determining the ranging information for the associated unit is partiafly available or valid with respect to a second portion of the views; determining the second portion of the views for which ranging information for the associated unit is partialiy avaflable or valid; determining the ranging information for the associated unit is unavailable or invalid for a first at least one processing operation; determining the first at least one processing operation for which the ranging informaLion for the associated unit is unavailable or invalid; determining the ranging information for the associated unit is available or valid for a second at least one processing operation; determining the second at east one processing operation for which the ranging information for the associated unit is available or valid; and determining the quality of the ranging information for the associated unit is below a determined threshold.

Generating at east one indicator based on the validity of the ranging information may comprise at least one of: generating at east one indicator based on the validity of the ranging information within a sequence parameter set; generating at least one indicator based on the validity of the ranging information within a picture parameter set: generating at least one indicator based on the validity of the ranging information within a supplemental enhancement information (SEI) message; and generating at least one indicator based on the validity of the ranging information within a slice header.

Generating at least one indicator based on the validity of the ranging information within a sequence parameter set may comprise generating a sequence parameter such as

I

seq parameter set 3davc exten&onfl{ C Oescriptor II.

wherein the depth output flag is the at east one indicator.

Generating at least one indicator based on the validfty of the ranging informaflon may comprise at least one of: generating the at east one inthcator based on a scope of the vafldity of the ranging infbrrnaUon; and generating the at least one indicator based on a scope of the persistence of the ranging information.

Generating at east one indicator based on the validity of the ranging information may comprise generating a map of indicators defining which ranging information areas are available and which areas are not.

Generating a map of indicators may comprise generating a quadtree structure map of indicators.

The method may further comprise controHing the encoding of the ranging information based on the v&idity of the ranging informaUon.

Generating at least one indicator based on the validity of the ranging information may comprise at east one indicator associated with the controlling the encoding of the ranging information based on the validity of the ranging information.

According to a second aspect there is provided a method comprising: receiving a signal comprising at least one indicator wherein the at least one indicator is based on ranging information validity; and controlling a utilization of ranging information based on the at least one indicator.

The signal comprises the ranging information for a set of views arid the method may further comprise: determining or obtaining a unit associated with the set of views; determining the ranging information vafldity for the unit based on the at least one indicator.

Controlling the utiHzation of ranging information may comprise at least one of: omitting the output of the ranging information for the unit from a decoding process; outputting the ranging information for the unit from the decoding process; omitting a first at least one processing operation using the ranging information for the unit; and performing a second at least one processing operation using the ranging information for the unit.

The ranging information for a set of views may comprise invaild or unavailable ranging information, and wherein the method may further comprise replacing the invaUd or unavailable ranging information based on the at east one indicator.

The at least one indicator may comprise at least one of: at least one indicator that the ranging information for the associated unit is partiaily unavailable or invalid with respect to a first portion of the views; at least one indicator of the first portion for which ranging information for the associated unit is partially unavailable or invalid; at least one indicator that the ranging information for the associated unit is partially available or valid with respect to a second portion of the views; and at least one indicator of the second portion for which ranging information for the associated unit is partially available or valid.

Determining or obtaining the unit associated with the set of views may comprise determining at least one of: a subset of views as the unit; a set of pictures within each view in the set of views as the unit; and a spatial region within at least one picture within the set of pictures as the unit, Controlling a utilization of ranging information based on the at least one indicator may comprise performing at least one processing operation on the at east partial ranging information for a set of views for at east one processing operation based on the at least

D

one indicator, wherein the at east one indicator may comprise at least one of: at least one indicator that the ranging information for the associated unit is unavailable or invalid for at east one processing operation: at least one indicator indicating the at least one processing operation for which the ranging information for the associated unit is unavailable or invafld for; at least one indicator that the ranging information for the associated unit is available or valid for at east one processing operation; and at east one indicator indicating the at least one processing operation for which the ranging information for the associated unit is available or valid for.

The at least one indicator may comprise at least one of: at least one indicator indicating the ranging information is not output from a decoding process; at least one indicator indicating the ranging information is marked unavailable or invalid; at least one indicator indicating a hole filling process is to he employed for the ranging information; at least one indicator indicating the ranging information is not to he used for depthimage based rendering; at least one indicator indicating the ranging information can be replaced with an output of a depth estimation algorithm from decoded texture views; at least one indicator indicating the ranging information is unavaUable or invalid; at least one indicator indicating the ranging information is partially unavailable or invalid with respect to a first portion of the views; at least one indicator indicating the first portion for which ranging information is partially unavailable or invalid; at least one indicator indicating the ranging information is partially available or valid with respect to a second portion of the views; at least one indicator indicating the second portion for which ranging information for the associated unit is partially available or valid; at least one indicator indicating the ranging information is unavailable or invalid for a first at least one processing operation; at least one indicator indicating the first at least one processing operation for which the ranging information for the associated unit is unavailable or invalid; at least one indicator indicating the ranging information is available or valid for a second at least one processing operaLion: at least one indicator indicating the second at least one processing operation for which the ranging information for the associated unit is available or valid; and at least one indicator indicating the quality of the ranging information is below a determined threshold.

Determining a signal comprising at leasL one indicator may comprise determining at least one of: determining at east one indicator based on the vadity of the ranging information within a sequence parameter set; determining at least one indicator based on the validity of the ranging information within a picture parameter set; determining at east one indicator based on the validity of the ranging information within a supplemental enhancement information (SD) message; and determining at least one indicator based on the validity of the ranging information within a slice header.

Controlling a utUization of ranging information based on the at least one indicator may comprise at least one of: controfling a utilization of ranging information based on a scope of the validity of the ranging information; and controliing a utUization of ranging information based on a scope of the persistence of the ranging information.

Determining a &gnal comprising at least one indicator may comprise determining a map of indicators defining which ranging information areas are available and which areas are not.

According to a third aspect there is provided an apparatus compri&ng at least one processor and at least one memory including computer code for one or more programs, the at east one memory and the computer code configured to with the at east one processor cause the apparatus to: determine ranging information for a set of views; determine a unit associated with the set of views; determine the validity of the ranging information for the unit; and generate at least one indicator based on the validity of the ranging information.

Determining the unit associated with the set of views may cause the apparatus to determine at least one of: a subset of views as the unit; a set of pictures within each view in the set of views as the unit; and a spatial region within at east one picture within the set of pictures as the unit.

Determining the vaUdity of the ranging information for the unit may further cause the apparatus to determine at least one of: the ranging information for the associated unit is not output from a decoding process; the ranging information for the associated unit is marked unavailable or invaUd; a hole fifing process is to he employed for the ranging information for the associated unit; the ranging information for the associated unit is not to be used for depthImage based rendering; the ranging information for the associated unit can be replaced with an output of a depth estimation algorithm from decoded texture views in a decoder section: the ranging information for the associated unit is unavailable or invalid; the ranging information for the associated unit is partiafly unavailable or invalid with respect to a first portion of the views; the first portion of the views for which ranging information for the associated unit is partially unavailable or invalid; the ranging information for the associated unit is partiafly available or valid with respect to a second portion of the views: the second portion of the views for which ranging information for the associated unit is partiafy available or vaUd; the ranging information for the associated unit is unavailable or invalid for a first at least one processing operaUon; the first at least one processing operation for which the ranging information for the associated unit is unavailable or invalid; the ranging information for the associated unit is available or valid for a second at east one processing operation; the second at east one processing operation for which the ranging information for the associated unit is available or valid; and the quality of the ranging information for the associated unit is below a determined threshold.

Generating at least one indicator based on the validity of the ranging information may cause the apparatus to generate at least one of: at east one indicator based on the validity of the ranging information within a sequence parameter set; at least one indicator based on the validity of the ranging information within a picture parameter set; at least one indicator based on the validity of the ranging information within a supplemental enhancement information (88) message; and at least one indicator based on the validity of the ranging information within a slice header.

Generating at least one indicator based on the validfty of the ranging information within a sequence parameter set may cause the apparatus to generate a sequence parameter such as C Descnpto tf(Nurn Depth Views > ___ _____-___ th output f Ia P (.1)

I ____

wherein the depth output flag is the at least one indicator, Generating at least one indicator based on the validity of the ranging information may cause the apparatus to generate at east one of: at least one indicator based on a scope of the validity of the ranging information; and at least one indicator based on a scope of the persistence of the ranging information.

Generating at least one indicator based on the validity of the ranging information may cause the apparatus to generate a map of indicators defining which ranging information areas are avaHable and which areas are not.

Generating a map of indicators may cause the apparatus to generate a quadtree structure map of indicators.

The apparatus may be further caused to control the encoding of the ranging information based on the validity of the ranging information.

Generating at least one indicator based on the validity of the ranging information may cause the apparatus to generate the at east one indicator associated with the controlling the encoding of the ranging information based on the vahdity of the ranging information.

According to a fourth aspect there is provided an apparatus comprising at east one processor and at east one memory including computer code for one or more programs, the at least one memory and the computer code configured Lo with the at least one processor cause the apparatus to: receive a signal comprising at least one indicator wherein the at east one indicator is based on ranging iniormaflon vaUdity; and control a utflization of ranging information based on the at least one indicator, The signal comprises the ranging information for a set of views and the apparatus may be further caused to: determine or obtain a unit associated with the set of views; and determine the ranging information validity for the unit based on the at least one indicator.

ControUing the uUUzation of ranging information may cause the apparatus to perform at east one of: omit the output of the ranging information for the unit from a decoding process; output the ranging information for the unit from the decoding process; omit a first at least one processing operation using the ranging information for the unit; and perform a second at least one processing operation using the ranging information for the unit.

The ranging information for a set of views may comprise invalid or unavailable ranging information, and wherein the apparatus may be further caused to replace the invalid or unavailable ranging information based on the at east one indicator.

The at least one indicator may comprise at least one of: at least one indicator that the ranging information for the associated unit is partiafly unavailable or invalid with respect to a first portion of the views; at least one indicator of the first portion for which ranging information for the associated unit is partially unavailable or invalid; at least one indicator that the ranging information for the associated unit is parllaUy available or valid with respect to a second portion of the views; and at east one indicator of the second portion for which ranging information for the associated unit is partially available or valid, Determining or obtaining a unit associated with the set of views may cause the apparatus to determine at least one of: a subset of views as the unt; a set of pictures within each view in the set of views as the unit; and a spatial region within at least one picture within the set of pictures as the unit.

Controlling a utilization of ranging information based on the at least one indicator may cause the apparatus to perform at least one processing operation on the at least partial ranging information for a set of views for at least one processing operation based on the at least one indicator, wher&n the at least one indicator may comprise at least one of: at least one indicator that the ranging information for the associated unit is unavailable or invalid for at east one processing operation; at least one indicator indicating the at least one processing operation for which the ranging information for the associated unit is unavailable or invalid for; at least one indicator that the ranging information for the associated unit is available or valid for at least one processing operation; and at east one indicator indicating the at least one processing operation for which the ranging information for the associated unit is available or valid for.

The at least one indicator may comprise at least one of: at least one indicator indicating the ranging information is not output from a decoding process; at east one indicator indicating the ranging information is marked unavaHable or invalid; at least one indicator indicating a hole filling process is to be employed for the ranging information; at least one indicator indicating the ranging information is not to be used for deplh4mage based rendering; at least one indicator indicating the ranging information can be replaced with an output of a depth estimation algorithm from decoded texture views; at least one indicator indicating the ranging information is unavailable or invalid; at least one indicator indicating the ranging information is partially unavailable or invalid with respect to a first portion of the views; at least one indicator indicating the first portion for which ranging information is partially unavailable or invalid; at least one indicator indicating the ranging information is partially available or valid with respect to a second portion of the views; at east one indicator indicating the second portion for which ranging information for the associated unft is partiaUy available or vafld; at least one indicator indicating the ranging information is unavailable or invad for a first at east one processing operation; at least one indicator indicating the first at least one processing operation for which the ranging information for the associated unit is unavaable or invaUd; at least one indicator indicating the ranging information is avaable or vaUd for a second at least one processing operation; at least one indicator indicating the second at least one processing operation for which the ranging information for the associated unit is avaflable or vaUd; and at least one indicator indicaUng the quality of the ranging information is b&ow a determined threshold.

Determining a signal comprising at east one indicator may cause the apparatus to determine at least one of: at least one indicator based on the vahdity of the ranging information within a sequence parameter set; at least one indicator based on the validity of the ranging information within a picture parameter set; at least one indicator based on the validity of the ranging information within a supplemental enhancement information (SEI) message; and at least one indicator based on the validity of the ranging information within a slice header.

Controlling a utilization of ranging information based on the at least one indicator may cause (he apparatus to control at least one of: a utilization of ranging information based on a scope of (he validity of the ranging information; and a utilization of ranging information based on a scope of the persistence of the ranging information, Determining a signal comprising at least one indicator may cause the apparatus to determine a map of indicators defining which ranging information areas are available and which areas are not.

According to a fifth aspect there is provided an apparatus comprising: means for determining ranging information for a set of views; means for determining a unit associated with the set of views; means for determining the vaUdfty of the ranging Ia., information for the unit; and means for generating at east one indicator based on the v&idity of the ranging inforrnaflon.

The means for determining the unit associated with the set of views may comprise at least one of: means for determining a subset of views as the unit; means for determining a set of pictures within each view in the set of views as the unit; and means for determining a spatial region within at east one picture within the set of pictures as the unit.

The means for determining the vaUdity of the ranging information for the unit may further comprise at least one of: means for determining the ranging information for the associated unit is not output from a decoding process; the ranging information for the associated unit is marked unavailable or invahd; means for determining a hole fiRing process is to be employed for the ranging information for the associated unit; means for determining the ranging information for the associated unit is not to be used for depth image based rendering; means for determining the ranging information for the associated unit can he replaced with an output of a depth estimation algorithm from decoded texture views in a decoder section; means for determining the ranging information for the associated unit is unavailable or invalid; means for determining the ranging information for the associated unit is partialiy unavailable or invalid with respect to a first portion of the views; means for determining the first portion of the views for which ranging information for the associated unit is partially unavailable or invalid; means for determining the ranging information for the associated unit is partially available or valid with respect to a second portion of the views; means for determining the second portion of the views for which ranging information for the associated unit is partially available or valid; means for determining the ranging information for the associated unit is unavailable or invalid for a first at least one processing operation; means for determining the first at least one processing operation for which the ranging information for the associated unit is unavailable or invalid; means for determining the ranging information for the associated unit is available or valid for a second at least one processing operation; means for determining the second at least one processing operaflon for wNch the rangkig information for the associated unit is avaHable or vaUd; and means for determining the quality of the ranging information for the associated unit is below a determined threshold, The means for generating at least one indicator based on the vaUdity of the ranging information may comprise at least one of: means for generating at least one indicator based on the vadity of the ranging information within a sequence parameter set; means for generating at least one indicator based on the validity of the ranging information within a picture parameter set; means for generating at least one indicator based on the validity of the ranging information within a supplemental enhancement information (SD) message; and means for generating at east one indicator based on the validity of the ranging information within a slice header.

The means for generating at least one indicator based on the validity of the ranging information within a sequence parameter set may comprise means for generating a sequence parameter such as seq.,,parameter set 3davc extenson() { C Descnptor dith1iiiiju:.jig +* wherein the depth output flag is the at least one indicator.

The means for generating at least one indicator based on the vafldity of the ranging information may comprise at least one of: means for generating at least one indicator based on a scope of the validity of the ranging information; and means for generating at least one indicator based on a scope of the persistence of the ranging information.

The means for generating at least one indicator based on the validity of the ranging information may comprise means for generating a map of indicators defining which ranging information areas are available and which areas are not, The means for generating a map of indicators may comprise means for generafing a quadtree structure map of indicators.

The apparatus may further comprise means for controffing the encoding of the ranging information based on the vaHdity of the ranging information.

The means for generating at east one indicator based on the vafldity of the ranging information may comprise means for generating the at east one indicator associated with the controHing the encoding of the ranging information based on the vaUdity of the ranging information.

According to a sixth aspect there is provided an apparatus comprising: means far receiving a signS comprising at east one indicator wherein the at east one indicator is based on ranging information vaUdity; and means for controlling a utUization of ranging information based on the at east one indicator.

The signa' comprises the ranging information for a set of views and the apparatus may further comprise: means for determining or obtaining a unit associated with the set of views; and means for determining the ranging information va'idity for the unit based on the at east one indicator.

The means for controfling the utiHzation of ranging information may comprise at east one of: means for omitting the output of the ranging information for the unit from a decoding process; means of outputting the ranging information for the unit from the decoding process; means of omitting a first at east one processing operation using the ranging information for the unit; and means of performing a second at least one processing operation using the ranging information for the unit.

The ranging information for a set of views may comprise invalld or unavaable ranging information, and wherein the apparatus may further comprise means for replacing the invafld or unavaable ranging information based on the at east one indicator, The at east one indicator may comprise at least one of: at least one indicator that the ranging information for the associated unit is partially unavallable or invalid with respect to a first portion of the views; at least one indicator of the first portion for which ranging information for the associated unit is partially unavailable or invalid; at least one indicator that the ranging information for the associated unit is partially avaflable or valid with respect to a second portion of the views; and at east one indicator of the second portion for which ranging information for the associated unit is partially available or valid.

The means for determining or obtaining a unit associated with the set of views may comprise at least one of: means for determining a subset of views; means for determining a set of pictures within each view in the set of views; and means for determining a spatial region within at least one picture within the set of pictures as the unit.

The means for controlling a utilization of ranging information based on the at least one indicator may comprise means for performing at least one processing operation on the at east partial ranging information for a set of views for at least one processing operation based on the at least one indicator, wherein the at least one indicator may comprise at least one of: at least one indicator that the ranging information for the associated unit is unavailable or invalid for at least one processing operation; at east one indicator indicating the at least one processing operation For which the ranging information for the associated unit is unavailable or invalid For; at least one indicator that the ranging information for the assocated unit is available or valid for at least one processing operation; and at least one indicator indicating the at least one processing operation for which the ranging information for the associated unit is available or valid for.

The indicator may comprise at east one of: at east one indicator indicating the ranging information is not output from a decoding process; at least one indicator indicating the ranging information is marked unavaable or invahd; at least one indicator indicaUng a hole fiHing process is to be employed for the ranging information; at least one indicator indicating the ranging information is not to be used for depthlmage based rendering; at least one indicator indicafing the ranging information can be replaced with an output of a depth estimation algorithm from decoded texture views; at least one indicator indicating the ranging information is unavailable or invalid; at east one indicator indicating the ranging information is partiaily unavailable or invalid with respect to a first portion of the views; at least one indicator indicating the first portion for which ranging information is partially unavailable or invalid; at east one indicator indicating the ranging information is partially available or valid with respect to a second portion of the views; at least one indicator indicating the second porUon for which ranging information for the associated unit is partially available or valid; at least one indicator indicating the ranging information is unavailable or invakd for a first at east one processing operation; at least one indicator indicating the first at least one processing operation for which the ranging information for the associated unit is unavailable or invalid; at least one indicator indicating the ranging information is avaable or valid for a second at least one processing operation; at least one indicator indicating the second at least one processing operation for which the ranging information for the associated unit is available or vad; and at least one indicator indicating the quality of the ranging information is below a determined threshold.

The means for determining a signal comprising at least one indicator may comprise at least one of: means for determining at least one indicator based on the validity of the ranging information within a sequence parameter set; means for determining at least one indicator based on the validity of the ranging information within a picture parameter set; means for determining at least one indicator based on the validity of the ranging information within a supplemental enhancement information (SB) message; and means for determining at least one indicator based on the validity of the ranging information wrthin a slice header.

The means for controHing a utilization of ranging information based on the at least one indicator may comprise at east one of: means for controffing a utiUzation of ranging information based on a scope of the vafldity of the ranging information; and means for controUing a utifization of ranging informaflon based on a scope of the persistence of the ranging information.

The means for determining a signal comprising at east one indicator may comprise means for determining a map of indicators defining which ranging information areas are avaflable and which areas are not.

According to a seventh aspect there is provided an apparatus comprising: a depth data generator configured to determine ranging information for a set of views; a unit determiner configured to determine a unit associated with the set of views; a depth data vaUdity determiner configured to determine the vaUdity of the ranging infonnaUon for the unit; and an indicator inserter conflgured to generate at least one indicator based on the vaUdity of the ranging information.

The unit determiner may be configured to determine at east one of: a subset of views as the unit; a set of pictures within each view in the set of views as the unit; and a spatial region within at least one picture within the set of pictures as the unit.

The depth map generator may further be configured to determine at least one of: the ranging information for the associated unit is not output from a decoding process; the ranging information for the associated unit is marked unavailable or invalid; a hole fiUing process is to be employed for the ranging information for the associated unit; the ranging information for the associated unit is not to be used for depthiniage based rendering; the ranging information for the associated unit can be replaced with an output of a depth estimation algorithm from decoded texture views in a decoder section; the ranging information for the associated unit is unavailable or invalid; the ranging information for the associated unit is partially unavailable or invalid with respect to a first portion of the views; the first portion of the views for which ranging information for the associated unit is parflaUy unavaUable or invaUd; the ranging informaUon for the associated unit is partially avaHable or valid with respect to a second portion of the views; the second portion of the views for which ranging information for the associated unit is partialiy avaable or valid; the ranghig information for the associated unit is unavaVahie or invalid for a first at east one processing operation; the first at east one processing operation for which the ranging information for the associated unit is unavaVahie or invalid; the ranging information for the associated unit is available or valid for a second at least one processing operation; the second at east one processing operation for which the ranging information for the associated unit is avaUahie or valid; and the quality of the ranging information for the associated unit is below a determined threshold, The indicator inserter may be configured to generate at east one of: at east one indicator based on the validity of the ranging information within a sequence parameter set; at least one indicator based on the validity of the ranging information within a picture parameter set; at east one indicator based on the validity of the ranging information within a supplemental enhancement information (SD) message; and at least one indicator based on the validity of the ranging information within a slice header.

The indicator inserter may be configured to generate a sequence parameter such as 1..P çppo LTh!a.

wherein the depth output flag is the at least one indicator.

The indicator inserter may be configured to generate at east one of: at east one indicator based on a scope of the validity of the ranging information; and at least one indicator based on a scope of the persistence of the ranging information.

The indicator inserter may be configured to generate a map of indicators defining which ranging information areas are avaUable and which areas are not.

The indicator inserter configured to generate a map of indicators may be configured to generate a quadtree structure map of indicators.

The apparatus may further comprise a controUer configured to control the encoding of the ranging information based on the vaUdity of the ranging information, The indicator inserted may be configured to generate the at least one indicator associated with the controlling the encoding of the ranging information based on the vaHdity of the ranging inforrnaflon.

According to an eighth aspect there is provided an apparatus comprising: an indicator detector configured to receive a signal comprising at least one indicator wherein the at least one indicator is based on ranging information validity; and a decoder controfler may be configured to control a utilization of ranging information based on the at least one indicator.

The signal comprises the ranging information for a set of views and the apparatus may further comprise: a unit determiner configured to determine or obtain a unit associated with the set of views; an indicator determiner configured to determine the ranging information validity for the unit based on the at least one indicator.

The decoder controller may comprise at least one of: omit the output of the ranging information for the unit from a decoding process; output the ranging information for the unit from the decoding process; omit a first at least one processing operation using the ranging information for the unit; and perform a second at least one processing operation using the ranging information for the unit. nfl

LU

The ranging information for a set of views may comprise invaUd or unavaDable ranging information, and wherein the apparatus may further comprise a processor configured to replace the invaHd or unavaflable ranging information based on the at least one indicator.

The at least one indicator may comprise at east one of: at east one indicator that the ranging information for the associated unit is partially unavailable or invalid with respect to a first portion of the views; at least one indicator of the first portion for which ranging information for the associated unit is partially unavailable or invalid; at least one indicator that the ranging information for the associated unit is partially available or valid with respect to a second portion of the views; and at east one indicator of the second portion for which ranging information for the associated unit is partiaUy available or valid.

The unit determiner may be configured to determine at least one of: a subset of views; a set of pictures within each view in the set of views; and a spatial region within at least one picture within the set of pictures as the unit.

The decoder controller may be configured to control at least one processing operaflon on the at least partial ranging information for a set of views based on the at east one indicator, wherein the at least one indicator may comprise at least one of: at east one indicator that the ranging information for the associated unit is unavailable or invalid for at least one processing operation; at least one indicator indicating the at east one processing operation for which the ranging information for the associated unit is unavailable or invalid for; at least one indicator that the ranging information for the associated unit is available or valid for at least one processing operation; and at least one indicator indicating the at least one processing operation For which the ranging information for the associated unit is available or valid for.

The indcator may comprise at least one of: at least one indicator indicating the ranging information is not output from a decoding process; at least one indicator indicating the ranging information is marked unavailable or invalid; at least one indicator indicating a hole filling process is to be employed for the ranging information; at least one indicator indicating the ranging information is not to be used for deptft4mage based rendering; at least one indicator indicaUng the ranging information can be replaced with an output of a depth estimation algorithm from decoded texture views; at east one indicator indicating the ranging information is unavailable or invalid; at east one indicator indicating the ranging information is partiaily unavailable or invalid with respect to a first portion of the views; at least one indicator indicating the first portion for which ranging information is partialiy unavailable or invalid; at east one indicator indicating the ranging information is partiafly available or vaild with respect to a second portion of the views; at least one indicator indicating the second portion for which ranging information for the associated unit is partiailv available or valid; at east one indicator indicating the ranging information is unavailable or invaUd for a first at least one processing operation; at east one indicator indicating the first at east one processing operation for which the ranging information for the associated unit is unavailable or invalid; at least one indicator indicating the ranging information is available or valid for a second at east one processing operation; at least one indicator indicating the second at least one processing operation for which the ranging information for the associated unit is available or valid; and at least one indicator indicating the quahty of the ranging information is below a determined threshold.

The indicator detector may be configured to determine at least one of: at least one indicator based on the validity of the ranging information within a sequence parameter set; at least one indicator based on the validity of the ranging information within a picture parameter set; at least one indicator based on the validity of the ranging information within a supplemental enhancement information (SEl) message; and at least one indicator based on the validity of the ranging information within a slice header.

The decoder controller may control at east one of: a utilization of ranging information based on a scope of the validity of the ranging information; and a utilization of ranging information based on a scope of the persistence of the ranging information.

The indicator detector may he configured to determine a map of indicators defining which ranging information areas are available and which areas are not.

BRIEF DESCRIPTION OF THE_DRAWINGS

For a more complete understanding of example embodiments of the present invention, reference is now made to the foHowing descriptions taken in connection with the accompanying drawings in which: Figure 1 shows schematicafly an electronic device employing some embodiments of the invention: Figure 2 shows schematicafly a user equipment suitable for employing some embodiments of the invention; Figure 3 further shows schematically electronic devices employing embodiments of the invention connected using wireless and/or wired network connections; Figure 4a shows schematicaHy an embodiment of an encoder; Figure 4b shows schematically an embodiment of a depth data validity apparatus of the encoder according to some embodiments; Figure 5a shows schematicay an embodiment of a decoder; Figure Sb shows schematically an embodiment of a depth data validity decoding apparatus according to sonic embodiments; Figure 6a illustrates an example of spatial and temporal prediction of a prediction unit; Figure Gb illustrates another example of spatial, and temporal prediction of a prediction unit; Figure 6c depicts an example for directmode motion vector inference; Figure 7 shows an example of a picture consisting of two tiles; Figure 8 shows a simplified model of a DlBRbased 3DV system; Figure 9 shows a simplified 2D model of a stereoscopic camera setup; Figure 10 depicts an example of a current block and five spatial neighbours usable as motion prediction candidates; Figure 11 a illusLrates operation of the HEVC merge mode for multiview video; Figure 1 lb iHustrates operation of the HEVC merge mode for multiview video utilizing an additional reference index; Figure 12 depicts some examples of asymmetric stereoscopic video coding types; Figure 13 shows an example of mapping of a depth map into another view; Figure 14 shows an example of generation of an initial depth map estimate after coding a flrst dependent view of a random access unit; Figure 15 shows an example of derivation of a depth map estimate for the current picture using motion parameters of an already coded view of the same access unit; Figure 16 shows an example of updating of a depth map estimate for a dependent view based on coded motion and disparity vectors; Figure 17 shows art example of locations of five spatial neighbouring blocks of a current block; Figure 18 shows an example of locations of temporal neighbouring blocks in temporal candidate pictures; Figure 19 shows an example where for a second prediction unit a certain block is not used for disparity vector derivation; Figure 20 shows an example where a BR block is not considered when it is located below a lower coding tree unit row of a current coding tree unit; Figure 21 shows an example of an inteEview predicted motion vector of a motion compensated prediction coded block; Figure 22 shows an example of derivation of a disparity from DV-4vICP neighbouring blocks of a current prediction unit where blocks above a current block are not used since they are not located within a current coding tree unit; Figure 23 shows an example where a maximum depth value is converted to disparity; Figure 24 shows a flow diagram of the operation of the depth data validity apparatus within the encoder as shown in Figure 4b; Figure 25 shows a flow diagram of the operation of the depth data validity apparatus within the decoder as shown in Figure 5b; and Figure 26 iflustrates an example processing fi ow for depth map coding within an encoder,

DETAILED DESCRIPTON OF SOME EXAMPLE EMBODIMENTS

In the foflowing, several embodiments of the invention wifi be described in the context of one video coding arrangement. It is to be noted, however, that the invention is not limited to this particular arrangement. In fact, the different embodiments have applications widely in any environment where improvement of reference picture handling is required. For example, the invention may be applicable to video coding systems flke streaming systems, DVD players, digital television receivers, personal video recorders, systems and computer programs on personal computers, haridheld computers and communication devices, as weD as network &ements such as transcoders and cloud computing arrangements where video data is handled.

In the foHowing. several embodiments are described using the convention of referring to (de)coding, which indicates that the embodiments may apply to decoding and/or encoding.

The H.264/AVC standard was developed by the Joint Video Team (JVT) of the Video Coding Experts Group (VCEG) of the Telecommunications Standardization Sector of International T&ecommunication Union (ITU-T) and the Moving Picture Experts Group (MPEG) of International Organisation for Standardization (ISO) / International Electrotechnical Commission (IEC). The H,2641AVC standard is published by both parent standardization organizations, and it is referred to as lTUT Recommendation H.264 and ISO/IEC International Standard 1449610, also known as MPEG4 Part 10 Advanced Video Coding (AVC). There have been multiple versions of the H.264/AVC standard, each integrating new extensions or features to the specification. These extensions include Scalable Video Coding (SVC) and Muitiview Video Coding (MVC).

The High Efficiency Video Coding (HEVC ak.a. H.265IHEVC) standard was developed by the Joint Collaborative Team Video Coding (JGTNC) of VCEG and MPEG.

Curreniiy, the ft265/HEVC standard is undergoing the final approval ballots in ISO/IEC and TUT. The standard will be published by both parent standardization organizations, and it is referred to as lTU-T Recommendation H.285 and ISOñEC lnternationa Standard 23008-2, also known as MPEGH Part 2 High Effidency Video Codng (HEVC). There are currently on going standardization projects to develop extensions to H.2$5/HEVC, induding scalable, multiview, threedimensional, and fidelity range extensions.

When describing H.264/AVC and HEVC as well as in examp'e embodiments, common noLation for arithmetic operators, logical operators, relational operators, bitwise operators, assignment operators, and range notation e.g. as specified in H264/AVC or a draft HEVC may be used. Furthermore, common mathematical functions eg, as speaRed in H.264/AVC or a draft HEVC may be used and a common order of precedence and execution order (from left to right or from right to left) of operators e.g. as specified in H.2641AVC or a draft HEVC may be used.

When describing H264/AVC and HEVO as well as in example embodiments, the following descriptors may be used to specify the parsing process of each syntax element.

-b(8): byte having any pattern of bit string (8 bits).

se(v): signed integer Exp-Golomb-coded syntax element with the left bit first.

u(n): unsigned integer using n bits. When n is "v" in the syntax table, the number of bits varies in a manner dependent on the value of other syntax elements. The parsing process for this descriptor is specified by n next bits from the bitstream interpreted as a binary representation of an unsigned integer with the most significant bit wriften first.

-ue(v): unsigned integer Exp-Golombcoded syntax element with the left bit first.

An Exp-Golomb bit string may be converted to a code number (codeNum) for example using the following table: Riifrfrg :010 1 L.....-.........

011 i2 3 5 00111 6 0001001 6 *OOTdTo*JT* A code number corresponding to an Exp-Golomb bit string may be converted to se(v) for example using the foow]ng table: codeNum syntax element value o 0 1 1 H2

C

When describing H.2641AVC and HEVC as weH as in example embodiments, syntax structures, semantics of syntax elements, and decothng process may be specified as Inflows. Syntax elements in the bitstream are represented in bold type. Each syntax element is described by its name (afi lower case letters with underscore characters), optionafly its one or two syntax categories, and one or two descriptors for its method of coded representation. The decoding process behaves according to the value of the syntax element and to the values of previously decoded syntax elements. When a value of a syntax element is used in the syntax tables or the text, it appears in regular (i.e., not hold) type. In some cases the syntax tables may use the values of other variables derived From syntax Sements values. Such variables appear in the syntax tables, or text, named by a mixture of ower case and upper case letter and wfthout any underscore characters. Variables starting with an upper case letter are derived for the decoding of the current syntax structure and aU depending syntax structures, Variables starting with an upper case letter may be used in the decoding process for later syntax structures without mentioning the originating syntax structure of the variable. Variables starting with a lower case letter are only used within the context in which they are derived, In some cases, "mnemonic" names for syntax element values or variable values are used interchangeably with their numerical values. Sometimes mnemonic" names are used without any associated numerical values. The association of values and names is specified in the text. The names are constructed from one or more groups of letters separated by an underscore character, Each group starts with an upper case letter and may contain more upper case letters.

When describing H.264/AVC and HEVC as weH as in example embodiments, a syntax structure may be specified using the following. A group of statements enclosed in curly brackets is a compound statement and is treated functionally as a single statement. A "while" structure specifies a test of whether a condition is true, and if true, specifies evaluation of a statement (or compound statement) repeatedly until the condition is no longer true Ado... while" structure specifies evaluation of a statement once, foHowed by a test of whether a condition is true, and if true, specifies repeated evaluation of the statement until the condition is no longer true. An 91... else" structure specifies a test of whether a condition is true, and if the condition is true, specifies evaluation of a primary statement, otherwise, specifies evaluation of an alternative statement. The "else" part of the structure and the associated alternative statement is omitted if no alternative statement evaluation is needed. A "br" structure specifies evaluation of an initial statement, followed by a test of a condition, and if the condition is true, specifies repeated evaluation of a primary statement foflowed by a subsequent statement until the condition is no longer true.

Some key definitions, bitstream and coding structures, and concepts of H2641AVC and HEVC are described in this section as an example of a video encoder, decoder.

encoding method, decoding method, and a bitstream structure, wherein the embodiments may be implemented. Some of the key definitions, bitstream and coding structures, and concepts of H.264/AVC are the same as in a draft HEVC standard hence, they are described b&ow jointly. The aspects of the invention are not limited to hL264/AVC or HEVC, but rather the description is given for one possible basis on top of which the invention may be partly or fufly realized.

Similarly to many earlier video coding standards, the bitstream syntax and semantics as weD as the decoding process for error$ree bitstreams are specified in H.264/AVC and HEVC. The encoding process is not specified, but encoders must generate conforming bitstreams. Bitstream and decoder conformance can be verified with the Hypothetical Reference Decoder (HRD). The standards contain coding tools that help in coping with transmission errors and losses, but the use of the tools in encoding is optional and no decoding process has been specified for erroneous bitsireams.

The elementary unit for the input to an H.264/AVC or HEVC encoder and the output of an H264/AVC or HEVC decoder, respectively, is a picture. A picture given as an input to an encoder may also be referred to as a source picture, and a picture decoded by a decoder may be referred to as a decoded picture.

The source and decoded pictures may each be comprised of one or more sample arrays, such as one of the following sets of sample arrays: Luma (Y) only (monochrome).

Luma and two chroma (YCbCr or YCgCo), Green, Blue and Red (GBR, also known as RGB).

Arrays representing other unspecified monochrome or tristimulus color samplings (for example, YZX, also known as XYZ), In the following, these arrays may be referred to as luma (or L or V) and chroma, where the two chroma arrays may be referred to as Cb and Cr; regardless of the actual color representation method in use. The actual color representation method In use may be indicated e.g. in a coded bitstream e.g. using the Video Usability Information (VUI) syntax of H.2641AVC and/or HEVC. A component may be defined as an array or a single sample from one of the three sample arrays (luma and two chroma) or the array or a single sample of the array that compose a picture in monochrome format.

In H.264/AVC and HEVC, a picture may either be a frame or a field. A frame comprises a matrix of luma samples and possibly the corresponding drama samples. A field Is a set of alternate sample rows of a frame and may be used as encoder input, when the source signal is interlaced. Chroma sample arrays may be absent (and hence monochrome sampling may be in use) or may be subsampled when compared to luma sample arrays. Some chroma formats may be summarIzed as follows: In monochrome sampling there is only one sample array, which may be nominally considered the luma army.

In 4:2:0 sampling, each of the two chroma arrays has half the height and half the width of the luma array.

In 4:2:2 sampling, each of the two chroma arrays has the same height and half the width of the luma array.

In 4:4:4 sampling when no separate color planes are In use, each of the two chroma arrays has the same height and width as the luma array.

In ft264/AVC and HEVC, it Is possible to code sample arrays as separate color planes into the bitstream and respectively decode separately coded color planes from the bitstream. When separate color planes are in use, each one of them is separately processed (by the encoder and/or the decoder) as a picture with monochrome sampling.

When chroma subsampling Is In use (e.g. 4:2:0 or 4:2:2 chroma sampling), the location of chroma samples with respect to luma samples may be determined in the encoder side (e.g. as preprocesshig step or as part of encoding). The chrome sample posftions with respect to uma sample positions may be predefined for example in a coding standard, such as H,264/AVC or HEVC, or may he indicated in the bitstream for example as part of VUI of H.264/AVC or HEVC.

A partitioning may be defined as a division of a set into subsets such that each element of the set is in exactly one of the subsets. A picture partitioning may be defined as a division of a picture into smafler nonoverIapping units. A block partitioning may be defined as a dMsion of a block into smafler non-overlapping units, such as subblocks.

In some cases term block partitioning may be considered to cover multiple evels of partitioning, for example partitioning of a picture into slices, and partitioning of each slice into smaller units, such as macroblocks of H.264/AVC. It is noted that the same unit, such as a picture, may have more than one partitioning. For example, a coding unit of a draft HEVC standard may he partitioned into prediction units and separately by another quadtree into transform units.

In H.2G4IAVC, a macroblock is a 16x16 block of lurna samples and the corresponding blocks of chroma samples. For example, in the 4:2:0 sampling pattern, a macroblock contains one 8x8 block of chrome samples per each chrome component. in H.264/AVC, a picture is partitioned to one or more slice groups, and a slice group contains one or more slices. In H.264JAVC, a slice consists of an integer number of macroblocks ordered consecutively in the raster scan within a particular slice group.

During the course of HEVC standardization the terminology for example on picture partitioning units has evolved. In the next paragraphs, some nonlimiting examples of HEVC terminology are provided.

In one draft version oF the HEVC standard, pictures are divided into coding units (CU) covering the area of the picture, A CU consists of one or more prediction units (PU) defining the prediction process for the samples within the CU and one or more transform units (TU) deflning the prediction error coding process for the samples in the CU. Typicafly, a CU consists of a square block of samples with a size selectable from a predefined set of possible CU sizes. A CU with the maximum Sowed size is typicafly named as LOU (largest codkig unit) and the video picture is divided into non overlapping LOUs. An LOU can further be spM into a combination of smSer CUs, e.g. by recursively splitting the LOU and resultant CUs. Each resulting CU may have at least one PU and at least one TU associated with it. Each PU and TU can further be split into smafler PUs and TUs in order to increase granularity of the prediction and prediction error coding processes, respectively. Each PU may have prediction information associated with it defining what kind of a prediction is to be applied for the pixels within that PU (e.g. motion vector information for inter predicted PUs and intra prediction directionality information for intra predicted Pus), Similarly, each TU may be associated with information describing the prediction error decoding process for the samples within the TU (including e.g. DOT coefficient information). It may be signaVed at CU lev& whether prediction error coding is applied or not for each CU. In the case there is no prediction error residual associated with the CU, it can be considered there are no TUs for the CU. In some embodiments the PU splitting can be reahzed by splitting the CU into four equal size square PUs or splitting the CU into two rectangle PUs vertically or horizontaUy in a symmetric or asymmetric way. The division of the image into CUs, and division of CUs into Pus and TUs may be signaHed in the bitstream Sowing the decoder to reproduce the intended structure of these units.

The decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain). After applying prediction and prediction error decoding means the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as a prediction reference for the forthcoming frames in the video sequence.

In a draft HEVC standard, a picture can be partitioned In tiles, which are rectangular and contain an Integer number of LCUs. In a draft HEVC standard, the partitioning to tiles forms a regular grid, where heights and widths of tiles differ from each other by one LCU at the maximum. In a draft HEVC, a slice consists of an integer number of CUt The GUs are scanned in the raster scan order of LGUs within tiles or within a picture, if tiles are not in use. Within an LCU, the GUs have a specific scan order.

A basic coding unit In a HEVC working draft 5 is a treeblock. A treeblock is an NxN block of luma samples and two corresponding blocks of chroma samples of a picture that has three sample arrays, or an NxN block of samples of a monochrome picture or a picture that Is coded using three separate colour planes. A treeblock may be partitioned for different coding and decoding processes. A treeblock partition Is a block of luma samples and two corresponding blocks of chroma samples reSting from a partitioning of a treeblock for a picture that has three sample arrays or a block of luma samples resulting from a partitioning of a treeblock for a monochrome picture or a picture that is coded using three separate colour planes. Each treeblock is assigned a partition signalling to identify the block sizes for intra or inter prediction and for transform coding.

The partitioning is a recursive quadtree partitioning. The root of the quadtree is associated with the treeblock. The quadtree Is split until a leaf is reached, which is referred to as the coding node. The coding node is the root node of two trees, the prediction tree and the transform tree. The prediction tree specifies the position and size of prediction blocks. The prediction tree and associated prediction data are referred to as a prediction unit. The transform tree specifies the position and size of transform blocks. The transform tree and associated transform data are referred to as a transform unit. The splitting information for luma and chroma is identical for the prediction tree and may or may not be Identical for the transform tree. The coding node and the associated prediction and transform units form together a coding unit.

In a HEVC WD5, pictures are divided Into shces and tiles. A slice may be a sequence of treeblocks but (when referring to a so-called fine granular slice) may also have its boundary within a treeblock at a locaUon where a transform unit and prediction unft coincide, Treeblocks within a slice are coded and decoded in a raster scan order. For the primary coded picture, the division of each picture into slices is a partitioning.

In a HEVC WD5, a tile is defined as an integer number of treeblocks cooccurring in one column and one row, ordered consecuUvely in the raster scan within the tUe. For the primary coded picture, the division of each picture into tHes is a partitioning. THes are ordered consecutively in the raster scan within the picture. Although a slice contains treeblocks that are consecutive in the raster scan within a the, these treeblocks are not necessarily consecutive in the raster scan within the picture. Slices and tiles need not contain the same sequence of treeblocks, A tile may comprise treeblocks contained in more than one slice. Similarly, a slice may comprise treeblocks contained in several tiles.

A distinction between coding units and coding treeblocks may be defined for example as foUows. A slice may be defined as a sequence of one or more coding tree units (CTU) in rasterscan order within a tile or within a picture if tiles are not in use. Each CTU may comprise one luma coding treeblock (CTB) and possibly (depending on Lhe chroma format being used) two chroma cTBs. A CTU may be defined as a coding tree block of luma samples, two corresponding coding tree blocks of chroma samples of a picture that has three sample arrays, or a coding tree block of samples of a monochrome picture or a picture that is coded using three separate colour planes and syntax structures used to code the samples. The division of a slice into coding tree units may be regarded as a partitioning. A CTB may be defined as an NxN block of samples for some value of N. The division of one of the arrays that compose a picture that has three sample arrays or of the array that compose a picture in monochrome format or a picture that is coded using three separate colour planes into coding tree blocks may be regarded as a partitioning. A coding block may be defined as an NxN block of samples for some value of N. The division of a coding tree block into coding blocks may be regarded as a partitioning.

Figure 7 shows an example of a picture consisting of two tiles partitioned into square coding units (solid lines) which have further been partftioned into rectangular prediction units (dashed lines).

In H.2O4IAVC and HEVC, n-pcture prediction may be disabled across slice boundaries.

Thus, slices can be regarded as a way to spht a coded picture into independently decodable pieces, and slices are therefore often regarded as elementary units for transmission, In many cases, encoders may indicate in the bitstream which types of in picture prediction are turned off across sUce boundaries, and the decoder operation takes this information into account for example when concluding which prediction sources are available. For example, samples from a neighbouring macroblock or CU may be regarded as unavailable for intra prediction, if the neighbouring macroblock or CU resides in a different sUce.

A syntax element may be defined as an element of data represented in the bitstream. A syntax structure may be defined as zero or more syntax elements present together in the bftstream in a specified order.

The elementary unit for the output of an H.2641AVC or HEVC encoder and the input of an H.2O4IAVC or HEVC decoder, respectively, is a Network Abstraction Layer (NAL) unit. For transport over packeLoriented networks or storage into structured files, NAL units may be encapsulated into packets or similar structures. A bytestream format has been specified in H.2641AVC and HEVC for transmission or storage environments that do not provide framing structures. The bytestream format separates NAL units from each other by attaching a start code in front of each NAL unit. To avoid false detection of NAL unit boundaries, encoders run a byt&oriented start code emulation prevention algorithm, which adds an emulation prevention byte to the NAL unit payload if a start code would have occurred oUerwise, In order to, for example, enable straightforward gateway operation between packet and streamoriented systems, start code emulation prevention may always be perFormed regardless of whether the bytestream format is in use or riot. A NAL unit may be defined as a syntax structure containing an indication of the type of data to foflow and bytes containing that data in the form of an RBSP interspersed as necessary with ernuation prevention bytes. A raw byte sequence payload (RBSP) may be defined as a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit. An RBSP is either empty or has the form of a string of data bits containing syntax elements foflowed by an RBSP stop bit and foflowed by zero or more subsequent bits equal to 0.

NAL units consist of a header and payload. In H.264/AVC and HEVC, the NAL unit header indicates the type of the NAL unit and whether a coded slice contained in the NAL unit is a part of a reference picture or a non-reference picture.

H.264/AVC NAL unit header includes a 2-bit n& ref idc syntax element, which when equal to 0 indicates that a coded slice contained in the NAL unit is a part of a non-reference picture and when greater than 0 indicates that a coded slice contained in the NAL unit is a part of a reference picture. The header for SVC and MVC NAL units may additionaUy contain various indications related to the scalabHity and multiview hierarchy.

In a draft HEVC standard, a two-byte NAL unit header is used for aH specified NAL unit types. The first byte of the NAL unit header contains one reserved bit, a one-bit indication nal refflag primarily indicating whether the picture carried in this access unit is a reference picture or a non-reference picture, and a six-bit NAL unit type indication, The second byte of fne NAL unit header includes a three-bit temporaHd indication for temporal level and a five-bit reserved field (caVed reserved one Ebits) required to have a value equal to I in a draft HEVC standard. The temporaHd syntax element may be regarded as a temporal identifier for the NAL unit and TemporaUd variable may be defined to be equal to the value of temporaHd. The five-bit reserved field is expected to be used by extensions such as a future scalable and 3D video extension. It is expected that these five bits would carry information on the scalabiluty hierarchy, such as quality Id or similar, dependency id or similar, any other type of layer identifier, view order index or similar, view identifier, an identifier similar to priority Id of SVC Indicating a valid sub-bitstream extraction if aU NAL units greater than a specific identifier value are removed from the bitstream. Without loss of generaUty, in some example embodiments a variable Layerld is derived from the vue of reserved_one5bits for example as foflows: Layerld = reserved one 5bits 1.

In a later draft HEVC standard, a two-byte NAL unit header is used for aU specified NAL unit types. The NAL unit header contains one reserved bit, a six-bit NAL unit type indication, a six-bit reserved field (caVed resented zero 8bits) and a three-bit temporaHdplusl indication for temporal level. The temporal id plusl syntax element may be regarded as a temporal identifier for the NAL unit, and a zero-based Temporafld variable may be derived as follows: Temporalld = temporal idplusl 1. Temporalld equal to 0 corresponds to the lowest temporal level. The value of temporaHd plusi is required to be non-zero in order to avoid start code emulation involving the two NAL unit header bytes. Without loss of generality, in some example embodiments a variable Layerld is derived from the value of reserved zero Ghits for example as follows: Layerld = reserved_zero6hits. In some designs for scalable extensions of HEVC, reserved zero 6bits are replaced by a layer identifier field e.g. referred to as nuhlayerid. In the following, Layerld, nuhlayerid and layer Id are used interchangeably unless otherwise indicated.

It is expected that reserved one 5bits, reserved zero Gbits and/or similar syntax elements in NAL unit header would carry information on the scalahility hierarchy. For example, the Layerld value derived from reserved one 5hits, reserved zero Ghits and/or similar syntax elements may be mapped to values of variables or syntax elements describing different scalability dimensions, such as quality id or similar, dependencyid or similar, any other type of layer identifier, view order index or similar, view identifier, an indication whether the NAL unit concerns depth or texture Le, depth flag or similar, or an identifier similar to priority id of SVC indicating a valid sub-bitstream extraction if all NAL units greater than a specific identifier value are removed from the bitstream. reservedone_bhits, reserved zero Sbits and/or similar syntax elements may be parUtioned into one or more syntax elements indicating scalahility properties. For example, a certain number of hits among reservedone5bits, reserved zero Obits and/or similar syntax elements may be used for dependency id or similar, while another certain number of Ms among reserved_one5bits, reserved zero Obits and/or simfiar syntax elements may he used for quality Id or similar, Alternatively, a mapping of Layerld values or similar to values of variables or syntax elements describing dftferent scalabflity dimensions may he provided for example in a Video Parameter Set, a Sequence Parameter Set or another syntax stwcture, NAL units can be categorized into Video Cothng Layer (VCL) NAL units and nonVCL NAL units, VCL NAL units are typicafly coded slice NAL units. In ft264/AVC, coded slice NAL units contain syntax elements representing one or more coded macroblocks, each of which corresponds to a block of samples in the uncompressed picture. In a draft HEVC standard, coded slice NAL units contain syntax elements representing one or more CU.

In H,264/AVC a coded slice NAL unU can be indicated to be a coded slice in an Instantaneous Decoding Refresh (IDR) picture or coded slice in a nonIDR picture.

In a draft HEVC standard, a coded slice NAL unit can be indicated to be one of the following types.

1iThnit type Name of Conten nd unit type syntax structure 0, TRAIL N, Coded slice segment of a non 1 TRAILR TSA, noft'STSA trailing picture slice,,segmentjayer_rbsp() 2, iSA N, Coded slice segment of a iSA 3 TSAR picture slice segment layer rbsp( STSAN, Coded slice an STSAR STSA picture si ceayerrhsp() 6, RADLN, Coded slice segment of a 7 RADLR RADL picture sceayerrbsp() 8, RASLN, Coded sce segment of a 9 RASLR, RASL picture sceayerrbsp() 10, RSVVCLNIO Reserved//reserved non:RA 12, RSVVCLNI2 non-roference VCL NAL unit RS'VC E1.±4 ?T.5 11, RSVVCLR1 1 Reserved 1/ reserved nonRAP 13, RSVJJCL_R1 3 reference VCL NAL unft types RSVVCLR15 16, BLAWLP Coded sce segment of a BLA 17, BLAWDLP picture 18 BLANLP sHoe segment layer rbsp( ) 19, DFtWDLP Coded &ice segment of an DR IDR_NLP picture slice segmenUayer rbsp( ) 21 CRA NUT Coded slice segment of a CRA picture slice segment layer rbsp() 22, RSV RAP VCL22.. Reserved ft reserved RAP VCL 23 RSV RAP VCL.23 NAL unit types 24. .31 RSVVCL24.. Reserved /1 reserved nonRAP RSVVCL3I VCL NAL unit types In a drafl HEVC standard, abbreviations for picture types may be defined as foHows: trading (TRAIL) picture, Temporal Subiayer Access (ThA), Stepwise Temporal Sub layer Access (STSA), Random Access Decodable Leading (RADL) picture, Random Access Skipped Leading (RASL) picture, Broken Link Access (BLA) picture, Instantaneous Decoding Refresh (IDR) picture, Clean Random Access (CFLA) picture.

A Random Access Point (RAP) picture is a picture where each silce or slice segment has nal unittype in the range of 16 to 23, inclusive. A RAP picture contains only intra-coded slices, and may be a BLA picture, a CRA picture or an DR picture. The first picture in the bitstream is a RAP picture. Provided the necessary parameter sets are avallable when they need to be activated, the RAP picture and all subsequent non RASL pictures in decoding order can be correctly decoded without performing the decoding process of any pictures that precede the RAP picture in decoding order. There may be pictures in a bitstream that contain only intracoded sllces that are not RAP pictures.

In HEVC a CRA picture may be the first picture in the bitstream in decoding order, or may appear later in the bitstream. CRA pictures in HEVC allow sccaHed leading pictures that follow the CRA picture in decoding order but precede it in output order.

Some of the leading pictures, so-called RASL pictures, may use pictures decoded before the CRA picture as a reference. Pictures that follow a CRA picture in both decoding and output order are decodahe if random access is performed at the CRA picture, and hence clean random access is achieved similarly to the clean random access functionality of an DR picture.

A CRA picture may have associated RADL or RASL pictures. When a CRA picture is the first picture in the bitstream in decoding order, the CRA picture is the first picture of a coded video sequence in decoding order, and any associated RASL pictures are not output by the decoder and may not he decodahie, as they may contain references to pictures that are not present in the bitstream.

A leading picture is a picture that precedes the associated RAP picture in output order.

The associated RAP picture is the previous RAP picture in decoding order (if present).

A leading picture may either be a RADL picture or a RASL picture.

AH RASL pictures are leading pictures of an associated BLA or CRA picture. When the associated RAP picture is a BLA picLure or is the first coded picture in the bitstream, the RASL picture is not output and may not be correctly decodable, as the RASL picture may contain references to pictures that are not present in the bitstream. However, a RASL picture can be correctly decoded if the decoding had started from a RAP picture before the associated RAP picture of the RASL picture. RASL pictures are not used as reference pictures for the decoding process of nonRASL pictures. When present, all RASL pictures precede, in decoding order, aM trailing pictures of the same associated RAP picture. In some earlier drafts of the HEVC standard, a RASL picture was referred to a Tagged for Discard (TFD) picture.

All RADL pictures are leading pictures. RADL pictures are not used as reference pictures for the decoding process of trailing pictures of the same associated RAP picture. When present, all RADL pictures precede, in decoding order, all trailing pictures of the same associated RAP picture. RADL pictures do not refer to any picture preceding the associated RAP picture in decoding order and can therefore be correctly decoded when the decoding starts from the associated RAP picture, In some earlier drafts of the HEVC standard, a RADL picture was referred to a Decodable Leading Picture (DLP).

Decodable leading pictures may be such that can be correctly decoded when the decoding is started from the CRA picture. In other words, decodable leading pictures use only the initial CRA picture or subsequent pictures in decoding order as reference in inter prediction. Non4ecodable leading pictures are such that cannot be correctly decoded when the decoding is started froni the initial CRA picture. In other words, non decodable leading pictures use pictures prior, in decoding order, to the initial CRA picture as references in inter prediction.

When a part of a bitstream starting from a CRA picture is included in another bitstream, the RASL pictures associated with the CRA picture night not be correcfly decodable, because some of their reference pictures might not be present in the combined bitstream. To make such a spflcing operation straightforward, the NAL unit type of the CRA picture can be changed to incEcate that it is a BLA picture. The RASL pictures associated with a BLA picture may not be correctly decodable hence are not be output/displayed. Furthermore. the RASL pictures associated with a BLA picture may be omitted from decoding.

A BLA picture may be the first picture in the bitstream in decoding order, or may appear later in the bitstream. Each BLA picture begins a new coded video sequence, and has simar effect on the decoding process as an DR picture. However, a BLA picture contains syntax elements that specify a nonempty reference picture set. When a BLA picture has nal unit type equal to BLAWLP, it may have associated RASL pictures, which are not output by the decoder and may not be decodable, as they may contain references to pictures that are not present in the bitstream. When a BLA picture has nal unit type equal to BLAWLP, it may also have associated RADL pictures, which are specified to be decoded. When a BLA picture has nd unit type equal to BLAWDLP, it does not have associated RASL pictures but may have associated RADL pictures, which are specified to be decoded. BLAWDLP may also be referred to as BLAWRADL. When a BLA picture has nal unit type equal to BLANLP, it does not have any associated leading pictures.

An DR picture having nal unit type equal to IDRNLP does not have associated leading pictures present in the bitstream. An DR picture having nal unit type equa to DRWDLP does not have associated RASL pictures present in the bitstream, buL may have assodated RADL pictures in the bitstream. IDRWDLP may also be referred to as IDRWRADL.

When the value of nal unit type is equal to TRAIL N, TSA_N, SISA N. RADLN, RASLN, RSV VCLN1O, RSV VCL 12, or RSV VCL N14, the decoded picture is not used as a reference for any other picture of the same temporal sub-layer. That is, In a draft HEVC standard, when the value of naunlt_ype is equal to TRAIL, ISA N, STSA_N, RADL_N, RASL_N, RSVjICLJ4IO, RSVyCL_N12, or RSV_VCL_N14, the decoded picture is not included In any of RefPicSetStCurrBefore, RefPlcSetStCurrAfter and RefPlcSetLtCurr of any picture with the same value of Temporaild. A coded picture with naL unit type equal to TRAIL_N. TSA_N, STSA_N, RADLJ4, RASL_N.

RSV_VCL_N1O, RSV_VCL_N12, or RSV_VCL_N14 may be discarded without affecting the decodability of other pictures with the same value of Temporalid.

A trailing picture may be defined as a picture that follows the associated RAP picture in output order. My picture that Is a trailing picture does not have nat_unit_type equal to RADL_N, RADL_R, RASL N or RASL_R. Any picture that Is a leading picture may be constrained to precede, In decoding order, all trailing pictures that are associated with the same RAP picture. No RASL pictures are present in the bitstream that are associated with a BLA picture having nat_unit_type equal to BLA_W_DLP or BLA_N_LP. No RADL pictures are present in the bitstream that are associated with a BLA picture having nat_unit_type equal to BLA_N_LP or that are associated with an 1DR picture having nal_unitjype equal to IDR_N_LP. My RASL picture associated with a CM or BLA picture may be constrained to precede any MDL picture associated with the CM or BLA picture In output order. Any RASL picture associated with a CM picture may be constrained to follow, in output order, any other RAP picture that precedes the CRA picture in decoding order.

Another means of describing picture types of a draft HEVC standard Is provided next.

As Illustrated In Error! Reference source not found.Error! Reference source not found..the table below, picture types can be classified into the following groups In HEVC: a) random access point (RAP) pictures, b) leading pictures, c) sub-layer access pictures, and d) pictures that do not fall into the three mentioned groups. The picture types and their sub-types as described in the table below are identified by the NAL unit type in HEVC. RAP picture types include IDR picture, BLA picture, and CRA picture, and can further be characterized based on the leading pictures associated with them as indicated in the table below, a) Random access point pictures netantaneous without associated leading picLures IDR decoding may have associated leading pictures refresh without associated leading pictures Broken link may have associated DLP pictures but without associated

BLA

access TFD pictures may have associated DL..P and TFD pictures Clean random may have associated leading pictures

CRA

access b) Leading pictures DLP Decodahie leading picture TFD Tagged for discard c) Temporal sublayer access pictures Temporal sub not used for reference in the same sub-layer

TSA

layer access may be used for reference in the same sub-ayer Step4vise not used for reference in the same sub-layer STSA temporal sub-may be used for reference in the same suhayer layer access d) Picture that is not RAP, leading or temporal subayer access picture not used for reference in the same sub-layer may be used for reference in the same subayer CRA pictures in HEVC allow pictures that foflow the CRA picture in decoding order but precede it in output order to use pictures decoded before the CRA picture as a reference and stiU allow similar clean random access functionality as an IDR picture.

Pictures that follow a CRA picture in both decoding and output order are decodable if random access is performed at the CRA picture, and hence clean random access is achieved, Leading pictures of a CRA picture that do not refer to any picture preceding the CRA picture in decoding order can be correctly decoded when the decoding starts from the CRA picture and are therefore DLP pictures. In contrast, a TFD picture cannot be correctly decoded when decoding starts from a CRA picture associated with the TFL) picture (while the TFD picture could be correctly decoded if the decoding had started from a RAP picture before the current CRA picture). Hence, TFD pictures associated with a CRA may be discarded when the decoding starts from the CRA picture.

When a part of a bitstream starting from a CRA picture is included in another bitstream, the TFD pictures associated with the CRA picture cannot be decoded, because some of their reference pictures are not present in the combined bitstream. To make such splicing operation straightforward, the NAL unit type of the CRA picture can be changed to indicate that it is a BLA picture. The TFD pictures associated with a BLA picture may not be correctly decodable hence should not be outpuUdisplayed. The TFD pictures associated with a BLA picture may be omitted from decoding.

In HEVC there are two picture types, the TSA and SISA picture types, that can be used to indicate temporal suh-Iayer switching points. If temporal sub-layers with Temporalld up to N had been decoded until the TSA or STSA picture (exclusive) and the TSA or STSA picture has Temporal Id equal to N+1, the TSA or STSA picLure enables decoding of all subsequent pictures (in decoding order) having Temporaild equal to N+1. The TSA picture type may impose restrictions on the TSA picture itself and all pictures in the same sub-layer that follow the ISA picture in decoding order. None of these pictures is allowed to use inter prediction from any picture in the same sub-layer that precedes the TSA picture in decoding order. The TSA definition may further impose restrictions on the pictures in higher sub-layers that follow the TSA picture in decoding order. None of these pictures is allowed to refer a picture that precedes the ISA picture in decoding order if that picture belongs to the same or higher sub-layer as the TSA picture. TSA pictures have Temporalld greater than 0. The STSA is simflar to the TSA picture but does not impose restrictions on the pictures in higher sub-layers that foflow the STSA picture in decoding order and hence enable up-switching only onto the subayer where the STSA picture resides.

In scalable and/or multiview video coding, at east the foflowing principles for encoding pictures and/or access units with random access properly may he supported.

A RAP picture within a layer may be an intra-coded picture without inteF layer/inter-view prediction. Such a picture enables random access capabUity to the layer/view it resides.

A RAP picture within an enhancement layer may he a picture without inter prediction (i.e. temporal prediction) hut with inter-layer/inter-view prediction aHowed.

Such a picture enables starting the decoding of the layer/view the picture resides provided that afl the reference ayers/views are available. In single-loop decoding, it may be sufficient it the coded reference layers/views are available (which can be the case e.g. for IOR pictures having dependency Id greater than 0 in SVC). In multi-loop decoding, it may be needed that the reference layers/views are decoded. Such a picture may, for example, be referred to as a stepwise layer access (STLA) picture or an enhancement layer RAP picture.

An anchor access unit or a complete RAP access unit may he defined to include only infl-coded picture(s) and STLA pictures in all layers. In multi-loop decoding, such an access unit enables random access to all layers/views. An example of such an access unit is the MVC anchor access unit (among which type the DR access unit is a special case).

A stepwise RAP access unit may be defined to include a RAP picture in the base layer but need not contain a RAP picture in aU enhancement layers. A stepwise RAP access unit enables starting of base-layer decoding, while enhancement layer decoding may be started when the enhancement layer contains a RAP picture, and (in the case of multiloop decoding) aU its reference layers/views are decoded at that point.

In a scalable extension of HEVC or any scalable extension for a singleayer coding scheme simUar to HEVC, RAP pictures may be specified to have one or more of the foHowing properties.

NAL unit type values of the RAP pictures with nuh layerEd greater than 0 may be used to indicate enhancement layer random access points.

An enhancement layer RAP picture may he defined as a picture that enabies starting the decoding of that enhancement layer when all its reference layers have been decoded prior to the EL RAP picture.

Interlayer prediction may be allowed for CRA NAL units with nuhlayerid greater than 0, while inter prediction is disallowed, CRA NAL units need not be aligned across layers. In other words, a CRA NAL unit pe can be used for aH VCL NAL units with a particular value of nuh layerid while another NAL unit type can be used for all VCL NAL units with another particular value of nuhJayerid in the same access unit.

BLA pictures have nuh layerJd equal to 0.

DR pictures may have nuhjayerjd greater than 0 and they may be interayer predicted while inter prediction is disallowed.

DR pictures are present in an access unit either in no layers or in all layers, i.e. an IDR nal unit type indicates a complete DR access unit where decoding of all layers can be started.

An STLA picture (STLAWDLP and STLANLP) may be indicated with NAL unit types BLAWDLP and BLANLP, respectively, with nuh layerEd greater than 0.

An STLA picture may be otherwise identical to an IDR picture with nuh layerid greater than 0 but needs not be aligned across layers.

After a BLA picture at the base layer, the decoding of an enhancement layer is started when the enhancement layer contains a RAP picture and the decoding of all of its reference iayers has been started.

When the decoding of an enhancement layer starts from a CRA picture, its RASL pictures are handled similarly to RASL pictures of a BLA picture.

Layer down-switching or unintentional loss of reference pictures Is Identified from missing reference pictures, in which case the decoding of the related enhancement layer continues only from the next RAP picture on that enhancement layer.

A non-Va NAL unit may be for example one of the following types: a sequence parameter set, a picture parameter set, a supplemental enhancement information (SEt) NAL unit, an access unit delimiter, an end of sequence NAL unit, an end of stream NAL unit, or a filler data NAL unit. Parameter sets may be needed for the reconstruction of decoded pictures, whereas many of the other non-VCL NAL units are not necessary for the reconstruction of decoded sample values.

Parameters that remain unchanged through a coded video sequence may be included In a sequence parameter set In addition to the parameters that may be needed by the decoding process, the sequence parameter set may optionally contain video usability Information (VUI), which Includes parameters that may be important for buffering, picture output timing, rendering, and resource reservation. There are three NAL units specified In H.2641AVC to carry sequence parameter sets: the sequence parameter set NAL unit (having NAL unit type equal to 7) contaIning all the data for H.2641AVC V NAL units in the sequence, the sequence parameter set extension NAL unit containing the data for auxiliary coded pictures, and the subset sequence parameter set for MVC and SVC VCL NAL units. The syntax structure Included in the sequence parameter set NAL unit of H.2641AVC (having NAL unit type equal to 7) may be referred to as sequence parameter set data, seq.parameter_setfiata, or base SPS data. For example, profile, level, the picture size and the chroma sampling format may be included in the base SPS data. A picture parameter set contains such parameters that are likely to be unchanged In several coded pictures.

In a draft HEVC, there is also another type of a parameter set, here referred to as an Adaptation Parameter Set (APS), which includes parameters that are likely to be unchanged in several coded slices but may change for example for each picture or each few pictures. In a draft HEVC, the APS syntax structure Includes parameters or syntax elements related to quariUzation matrices (CM), sample adaptive offset (SAC), adaptive bop filtering (ALF), and deblocking fiftering. In a draft HEVC, an APS is a NAL unit and coded without reference or prediction from any other NAL unit. An identifier, referred to as aps id syntax Sement, is included in APS NAL unit, and included and used in the slice header to refer to a particular APS.

A draft HEVC standard also indudes yet another type of a parameter set, called a video parameter set (VPS), A video parameter set RBSP may include parameters that can be referred to by one or more sequence parameter set RBSPs.

The relationship and hierarchy between VPS, SPS, and PPS may be described as foows. VPS resides one level above SPS in the parameter set hierarchy and in the context of scalabillty and/or 3DV. VPS may include parameters that are common for all slices across all (scalability or view) layers in the entire coded video sequence. SPS includes the parameters that are common for all slices in a particular (scalability or view) layer in the entire coded video sequence, and may be shared by multiple (scalability or view) layers. PPS includes the parameters that are common for all slices in a particular layer representation (the representation of one scalability or view layer in one access unit) and are likely to be shared by all slices in multiple layer representations.

VPS may provide information about the dependency relationships of the layers in a bitstream, as well as many other information that are applicable to all slices across all (scalability or view) layers in the entire coded video sequence. In a scalable exten&on of HEVC, VPS may for example include a mapping of the Layerld value derived from the NAL unit header to one or more scalability dimension values, for example correspond to dependency id, quality id, view id, and depth flag for the layer defined similarly to SVC and MVC. VPS may include profile and level information for one or more layers as well as the profile and/or level for one or more temporal suNayers (consisting of VCL NAL units at and below certain Temporalld values) of a layer representation.

An examp'e syntax of a VPS extension intended to be a part of the VPS is provided in the following. The presented VPS extension provides the dependency relationships among other things.

vpsextsnor.( ) { Descrbtor whflc( byte Ugriid( )) vpsxthsbnhyteagnmentres&vdonebt u( 1) avc base yerflaç u(1) spttnQfç u(1) 0, NumSc&abilftyTypes 16; 4+ ) { scdabHftymak[ u(1) NurnScaabUtyTypes 4= s sabhky mask[ I 0; j ccNurnScakfltyTypes; .+ Wmensonkflanmhwsi[ i] u(3) vps nihJayerJdpeaantflag Ut!) Fcr( 1; <= vps nix!ayer mnus; +t) 4 ft( vps nuh aver d prasentflaq ayeridhiriuh[ 1 u(S) for( = 0; j c NumScasbfltyTypes; drnen&onkfl j] u(v) for( = 1; c= vpsnufnopsetsrnThusl; t) { vpspro1presentflaç[ ii u(1) ((1 vps ptffie present flaq[ j) profile op ref tnhiu*1i I ue(v) profiietereve( vpsprotiepresentflag[ vpsmaxsubayersmnua1) numoutputoperatonpohits ue(v) for( 0; C nurn oukput opemtcn prts; ) { output op p&ntndex[ ] ue(v) iör( j:: C:: vps max nuh reserved zerosyerk; j+t) CduiTTW output ayer flaç[ or pcMnt &idex[ liii] u(1) fcr( = 1; c= vps max ayers rrdnusl; +4 for(j=O;jciJ++) The semantics of the presented VPS extension may be specified as described in the foowing paragraphs.

vps extension byte aflgnment reserved one bft is equal to I and is used to achieve alignment of the next syntax element to a byte boundary. ave base layer flag equal to 1 specifies that the base layer conforms to H.2641AVC, eqiS to 0 specifies that it conforms to this specification. The semantics of avc base layer flag may be further specified as foflows. When avc base layer flag equal to I, in H.264/AVC conforming base layer, after applying the FL264/AVC decoding process for reference picture lists construction the output reference picture lists refPicListo and refPicListl (when applicable) does not contain any pictures for which the Temporalld is greater than Temporaild of the coded picture. All sub-bitstreams of the H,2O4JAVC conforming base layer that can be derived using the subbitstream extraction process as specified in H.264/AVC Annex G with any value for temporal_id as the input result in a set of coded video sequences, with each coded video sequence conforming to one or more of the profiles specified in FL264/AVC Annexes A, G and H. spUtting_flag equal to I indicates that the bits of the nuhjayerjd syntax element in the NAL unit header are split into n segments with a length, in bits, according to the values of the dimensionjdjen_minusl[ ii syntax element and that the n segments are associated with the n scalability dimensions indicated in scalability mask fiag[ i]. When spUttirig_flag is equal to 1, the value of the j4h segment of the nuhjayer_id of i-4h layer shall be equal to the value of dimension id[ i][j], splitting_flag equal to 0 does not indicate the above constraint. When spfltting flag is equal to I, Le. the restriction reported in the semantics of the dimensionJd[ i][j] syntax element is obeyed, scalable identifiers can be derived from the nuh layerid syntax element in the NAL unit header by a bit masked copy as an alternative to the derivation as reported in the semantics of the dimension id[i][j J syntax element. The respective bit mask for the it scalable dimension is defined by the vaue of the dimension id en minusi [i] syntax element and dirnBitOffset[i] as specified in the semantics of dirnensionidenminus1[j].

scalabihtymask[ i] equa to 1 indicates that dirnen&onid syntax elements corresponding to the kh scalabiUty dimension are present. scalabUfty mask[ i] equal to 0 indicates that dimension id syntax elements corresponding to the Nh scalabiflty dimension are not present. The scalabihty dirnen&ons corresponthng to each value of of i in scalabihtv maskf i] may be specified for example to include the foflowing or any subset thereof along with other scalabUity dimensions.

scalabUity mask ScalabiUty ScalabiUtyld index dimension mapping o muitMew Viewid reference Dependencyld index based spatial or quality scalability 2 depth Depthflag 3 TextureBL TextureBLUepid based spatial or quality sc&abflity dimensionidknminusi [ii plus I specifies the length, in hits, of the dimension id[ i][ syntax Sement, The variable dimBitOiiset[ j is derived as foflows. The variable dirnBitOffset[ 0] is set to 0. dimBitOffset[ j] is derived to be equal to a cumulative sum in the range of dimldx from 0 to j 1, inclusive, for dimensionden minusi [dimldx ÷ 1).

vps nuh layer idpresent flag specifies whether the layeridinnuh[ i j syntax is present. layerjdjnnuh[ i] specifies the value of the nuh layerid syntax &ement in VCL NAL units of the Nh layer. When not present, the value of layerjdjnnuh[ i] is inferred to be equal to i. layer id in nuh[ i] is greater than layer Id innuh[ I]. The variable LayerldlnVps[ layer Id in nuh[ I]] is set equal to i.

dimensionid[ ][j] specifies the identifier of the j4h scalability dimension type of the i-tb layer. When not present, the value of dimension id[ i][j J is inferred to be equal to 0.

The number of bits used for the representation of dimension idj i][j] is dimensionidlenminusi [j] ÷ I bits.

thmen&onid[ ][j] and scalabffltymask[ I] may be used to derive variables associating scalability dimension values to ayers. For example, the variables Scalabilityld[ layerldlnVps][ scalabffityMasklndex] andViewld[ layerldlnNuh] may be derived as follows: for (I = 0; i < vps max layers minusi; i++) { for( srnldr" 0. j =0; smldxc 16; smldx ++) if( ( 0) && scalability mask[ smldx]) Scalabihtyld[ i)[ smtdx 3 = dimension id[ ][ j++ 3 else ScalabiUtyld[ J[ srnldx 3 = 0 Viewld[ layeridinnuh[ 1] = Scalabilityldi][ 0 Similarly, variables Dependencyld[ layerldlnNuh], DepthFlag[ layerldlnNuh], and TextureBLDepld[ layerldlnNuh I may be derived e.g. as foflows: for (I = 0; i <= vps max layers rninusl; i++) { for( smldx 0, j "0; srnldx< 16; smldx ++ if( (1 h' 0) && scalability mask[ smldx] ) Scalabilityld[ I][ smldx] = dimension id[ ][j++ else Scalabilityki[ I][ smldx] = 0 Dependencyld[ layer Id in nuh[ I]] = Scalabilityld[ i][ 1] DepthFlag[ layer id in nuh[ I] ] = Scalabilityld[ I][ 2 I TextureBLDepld[ layerJdjn_nuh( I]] = Scalabilityld[ i]( 3J vpsjrofiiepresent,jag( I] equal to I specifies that the profile and tier information for operation point i Is present in the profilejlerJeve) syntax structure.

vpsjrofiIe.presentjlag( I] equal to 0 specifies that profile and tier information for operation point I is not present In the profllejlerjevel( ) syntax structure and Is inferred.

proflieopjefjnlnusl[ I] inclcates that the profile and tier Information for the i-th operation point is Inferred to be equal to the profile and tier information from the (proflieopjefjninusl( I] + 1)-th operation point.

num_output_operatlonjolnts specifies the number of operation points for which output layers are specified with output_opjolntJndex[ I] and outputJayerjiag. When not present, the value of num_outputpperation,,polnts Is inferred to be equal to 0.

ouput._oppolntJndex[ I] Identifies the operation point for which outputJayerjlag[ op..pointJndex[ l]](J] applies to.

outputJayer_fiag( outputj,pjolntjndex( iJ 1( j] equal to I specifies that the layer with nuhJayerJd equal to j is a target output layer of the operation point identified by output_op_polntJndex[ iJ. outputJayerag( output_opjolntJndex[ I] ][ J] equal to 0 specifies that the layer with nuhJayerjd equal to J Is not a target output layer of the operation point Identified by output_opj,olntjndex[ I].

For each operation point Index J not equal to outpuLopjolntJndex[ I] for any value of I in the range 0 to num_putput_operatlonjolnts -I, Inclusive, let hlghestLayerid be the greatest value of nuhJayerJd within the operation point of Index j, outputjayer_flag( J if k] Is Inferred to be equal toO for all values of k in the range of 0 to 63, InclusIve, unequal to highestLayerid. outputjayerjlag(J]( highestLayerld] is inferred to be equal to 1.

In other words, when an operation point is not included among those Indicated by output_opjolntJndex[ I, the layer with the greatest value of nuhJayerJd within the operation point is the only target output layer of the operation point.

dlrectfiependency_flag[ Ill] J equal to 0 specIfies that the layer with Index j is not a direct reference layer for the layer with index i. directdependencyjlag[ I J( jJ equal to I specifies that the layer with index j may be a direct reference layer for the layer with index I. When direct_dependencyjlag( l](j) is not present for I and J in the range of 0 to vpsm_numJayers_minusl, it is inferred to be equal to 0.

The variables NumDirectRefLayers[ I] and RefLayerld[ I liii are derived as follows: for( I 1; I a vpsmaxJayers_mlnusl; i++) for( J 0, NumDlrectRetLayers( 11=0; j <I; J++) if( direct,dependencyjlag[ i](j I = = I) RefLayerld( I][ NumDirectRefLayers( I] layerJdJnjiuh( j] H.2641AVC and HEVC syntax allows many instances of parameter sets, and each Instance Is Identified wIth a unique identifier. In order to limit the memory usage needed for parameter sets, the value range for parameter set Identifiers has been limited. In H.2641AVC and a draft HEVC standard, each slice header Includes the identifier of the picture parameter set that is active for the decoding of the picture that contains the slice, and each picture parameter set contains the Identifier of the active sequence parameter set. In a HEVC standard, a slice header additionally contains an APS identifier.

Consequently, the transmission of picture and sequence parameter sets does not have to be accurately synchronized with the transmission of slices. Instead, it is sufficient that the active sequence and picture parameter sets are received at any moment before they are referenced, which allows transmission of parameter sets "out-of-band" using a more reliable transmission mechanism compared to the protocols used for the slice data. For example, parameter sets can be included as a parameter in the session description for Real-time Transport Protocol (RTP) sessions. If parameter sets are transmitted In-band, they can be repeated to improve error robustness.

A parameter set may be activated by a reference from a slice or from another active parameter set or in some cases from another syntax structure such as a buffering period SEt message. In the following, non-limiting examples of activation of parameter sets in a draft HEVC standard are given.

Each adaptation parameter set RBSP is Initially considered not active at the start of the operation of the decoding process. At most one adaptation parameter set RBSP is considered active at any given moment during the operation of the decoding process, and the activation of any particular adaptation parameter set RBSP results in the deactivation of the previously-active adaptation parameter set RBSP (ii any).

When an adaptation parameter set RBSP (with a particular value ol apsjd) is not active and it is referred to by a coded slice NAL unit (using that value of apsJd), It Is activated.

This adaptation parameter set RBSP is called the active adaptation parameter set RBSP until it is deactivated by the activation of another adaptation parameter set RBSP.

An adaptation parameter set RBSP, with that particular value of apsJd, is available to the decoding process prior to its activation, included In at least one access unit with temporaUd equal to or less than the temporaUd of the adaptation parameter set NAL unit, unless the adaptation parameter set Is provided through external means.

Each picture parameter set RBSP Is Initially considered not active at the start of the operation of the decoding process. At most one picture parameter set RBSP Is considered active at any given moment during the operation of the decoding process, and the activation of any particular picture parameter set RBSP results in the deactivation of the previously-active picture parameter set RBSP (if any).

When a picture parameter set RBSP (with a particular value of plc..parametersetJd) is not active and it is referred to by a coded slice NAL unit or coded slice data partition A NAL unit (using that value of picj,arameter_setjd), it is activated. This picture parameter set RBSP is Sled the active picture parameter set RBSP until it Is deactivated by the activation of another picture parameter set RBSP. A picture parameter set RBSP, with that paftcular value of plc parameter set Id, is avaable to the decoding process prior to Its activation, included in at east one access unit with temporaHd equal to or ess than the temporaHd of the picture parameter set NAL unit, unless the picture parameter set is provided through external means.

Each sequence parameter set RBSP is nitiafly considered not active at the start of the operation of the decoding process. At most one sequence parameter set RBSP is considered active at any given moment during the operation of the decoding process, and the activation of any particular sequence parameter set RBSP resthts in the deactivation of the previouslyactive sequence parameter set RBSP (if any).

When a sequence parameter set RBSP (with a particular value of seq parameter set Id) is not already active and it is referred to by activation of a picture parameter set RBSP (using that value of seq parameter set id) or is referred to by an SEI NAL unit containing a buffering period SEI message (using thaL value of separameter set Id). it is activated. This sequence parameter set RBSP is cafled the active sequence parameter set RBSP until it is deactivated by the activation of another sequence parameter set RI3SP. A sequence parameter set RBSP, with that particular value of seqparametesetid is available to the decoding process prior to its activation, included in at least one access unit with temporaHd equal to 0, unless the sequence parameter set is provided through external means. An activated sequence parameter set RBSP remns active for the entire coded video sequence.

Each video parameter set RBSP is initially considered not active at the start of the operation of the decoding process. At most one video parameter set RBSP is considered active at any given moment during the operation of the decoding process, and the activation of any particular video parameter set RBSP results in the deactivation of the previously-active video parameter set RBSP (if any).

When a video parameter set RBSP (with a particular value of videoparametersetid) s not already active and ft is referred to by activation of a sequence parameter set RBSF' (using that value of video parameter set id), it is activated. This video parameter set RBSP is cafled the active video parameter set RI3SP uniV it is deactivated by the activation of another video parameter set RBSP, A video parameter set RBSP, with that particular value of video parameter scUd is avaVable to the decoding process prior to its activation, included in at least one access unit with temporal id equal to 0, unless the video parameter set is provided through external means. An activated video parameter set RBSP remains active for the entire coded video sequence.

During operation of the decoding process in a draft HEVC standard, the values of parameters of the active video parameter set, the active sequence parameter set, the active picture parameter set RBSP and the active adaptation parameter set RBSP are considered in effect. For interpretation of SEI messages, the values of the active video parameter set, the active sequence parameter set, the active picture parameter set RBSP and the active adaptation parameter set RBSP for the operation of the decoding process For the VCL NAL units of the coded picture in the same access unit are considered in effect unless otherwise specified in the 38 message semantics.

A 38 NAL unit may contain one or more SE! messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, rendering, error detection, error concealment, and resource reservation. Several SEI messages are specified in H.264AVC and HEVC, and the user data SEI messages enable organizations and companies to specify SF1 messages for their own use.

R264/AVC and HEVC contan the syntax and semantics for the specified 58 messages but no process for handling the messages in the recipient is defined.

Consequently, encoders are required to foflow the H.2641AVC standard or the HEVC standard when they create SF1 messages, and decoders conforming to the ft264/AVC standard or the HEVC standard, respectively, are not required to process SF1 messages for output order conformance. One of the reasons to include the syntax and semanfics of SEI messages in H.264/AVC and HEVC is to aflow different system specifications to interpret the supplemental information identicafly and hence interoperate. It is intended that system specifications can requfte the use of particular SB messages both in the encoding end and in the decoding end, and additionafly the process for handUng particular SEI messages in the recipient can be specified.

A coded picture is a coded representation of a picture. A coded picture in H.264/AVC comprises the VCL NAL units that are required for the decoding of the picture. In H.2641AVC, a coded picture can he a primary coded picture or a redundant coded picture. A primary coded picture is used in the decoding process of valid bitstreanis, whereas a redundant coded picture is a redundant representation that should only be decoded when the primary coded picture cannot be successfully decoded. In a draft HEVC, no redundant coded picture has been specified.

In H.264/AVC, an access unit comprises a primary coded picture and those NAL units that are associated with it. In HEVC, an access unit is defined as a set of NAL units that are associated with each other according to a specified classification rule, are consecutive in decoding order, and contain exactly one coded picture. In H.264/AVC.

the appearance order of NP! units within an access unit is constrained as follows. An optional access unit delimiter NAL unit may indicate the start of an access unit. It is followed by zero or more SEI NAL units. The coded slices of the primary coded picture appear next. In H.264/AVC, the coded slice of the primary coded picture may be followed by coded slices for zero or more redundant coded pictures. A redundant coded picture is a coded representation of a picture or a part of a picture. A redundant coded picture may be decoded if the primary coded picture is not received by the decoder for example due to a loss in transmission or a corruption in physical storage medium.

In H.264/AVC, an access unit may also include an auxiliary coded picture, which is a picture that supplements the primary coded picture and may be used for example in the display process. An auxlliary coded picture may for example be used as an alpha channel or alpha plane specifying the transparency level of the samples in the decoded pictures. An alpha channel or plane may be used in a layered composition or rendering system, where the output picture is formed by overlaying pictures being at east partly transparent on top of each other. An auxiliary coded picture has the same syntactic and semantic restrictions as a monochrome redundant coded picture. In H.2641AVC, an auxfliary coded picture contains the same number of macroblocks as the primary coded picture.

In HEVC, an access unit may be defined as a set of NAL units that are associated with each other according to a specified classification rule. are consecutive in decoding order, and contain exactly one coded picture, In addition to containing the VCL NAL units of the coded picture, an access unit may also contain nonVCL NAL units. In HEVC the decoding of an access unit resuks in a decoded picture.

In H.2641AVC, a coded video sequence is defined to he a sequence of consecutive access units in decoding order from an IDR access unit, inclusive, to the next OR access unit, exclusive, or to the end of the bitstream, whichever appears earlier. In a draft HEVC standard, a coded video sequence is defined to be a sequence of access units that consists, in decoding order, of a CRA access unit that is the first access unit in the bitstream. an IOR access unit or a BLA access unit, followed by zero or more non OR and nonBLA access units including all subsequent access units up to hut not including any subsequent IOR or BLA access unit.

A group of pictures (GOP) and its characteristics may be defined as follows. A GOP can be decoded regardless of whether any previous pictures were decoded. An open GOP is such a group of pictures in which pictures preceding the initial intra picture in output order might not be correctly decodable when the decoding starts from the initial intra picture of the open GOP. In other words, pictures of an open GOP may refer (in inter prediction) to pictures belonging to a previous GOP. An H.264/AVC decoder can recognize an intra picture starting an open GOP from the recovery point SEI message in an H.264/AVC bitstream. An HEVC decoder can recognize an intra picture starting an open GOP, because a specific NAL unit type, CRA NAL unit type, is used for its coded slices. A closed GOP is such a group of pictures In which all pictures can be correctly decoded when the decoding starts from the initial Intra picture of the dosed GOP. in other words, no picture In a dosed GOP refers to any pictures in previous GOPs. in H.264/AVC and HEVC, a dosed GOP starts from an IDR access unit. in HEVC a dosed GOP may also start from a BLAW_DLP or a BLANLP picture. As a result, dosed GOP structure has more error resilience potential in comparison to the open GOP structure, however at the cost of possible reduction In the compression efficiency. Open GOP coding structure is potentially more efficient in the compression, due to a larger flexibility in selection of reference pictures.

A Structure of Pictures (SOP) may be defined as one or more coded pictures consecutive in decoding order, in which the first coded picture in decoding order is a reference picture at the lowest temporal sub-layer and no coded picture except potentially the first coded picture in decoding order is a RAP picture. The relative decoding order of the pictures is illustrated by the numerals inside the pictures. Any picture in the previous SOP has a smaller decoding order than any picture in the current SOP and any picture In the next SOP has a larger decodIng order than any picture in the current SOP. The tam group of pictures (GOP) may sometimes be used interchangeably with the term SOP and having the same semantics as the semantics of SOP rather than the semantics of closed or open GOP as described above.

The bitstream syntax of H.264/AVC and HEVC indicates whether a particular picture is a reference picture, which may be used as a reference hr inter prediction of any other picture. Pictures of any coding type (I, P. B) can be reference pictures or non-reference pictures In ft264/AVC and HEVC. In H.264/AVC, the NAL unit header Indicates the type of the NAL unit and whether a coded slice contained in the NAL unit is a part of a reference picture or a non-reference picture.

Many hybrid video codecs, Including H.264/AVC and HEVC, encode video information in two phases. In the first phase, predictive coding is applied for example as so-called sample prediction and/or as so-called syntax prediction. In the sample prediction, pixel or sample values in a certain picture area or "block" are predicted. These pixel or sample values can be predicted, for example, using one or more of the following ways: Motion compensation mechanisms (which may also he referred to as temporal prediction or motion-compensated temporal prediction or motion-compensated prediction or MCP), which involve finding and indicating an area in one of the previously encoded video frames that corresponds dosely to the block being coded.

lnterMew prediction, which involves finding and indicating an area in one of the previously encoded view components that corresponds closely to the block being coded.

View synthesis prediction, which involves synthesizing a prediction block or image area where a prediction block is derived on the basis of reconstructed/decoded ranging information.

InteNayer prediction using reconstructed/decoded samples, such as the so-cafled lntraBL (base layer) mode of SVC.

Inter-layer residual prediction, in which for example the coded residual of a reference layer or a derived residual from a difference of a reconstructed/decoded reference layer picture and a corresponding reconstructed/decoded enhancement layer picture may be used for predicting a residUal block of the current enhancement layer block. A residual block may be added for example to a motion-compensated prediction block to obtain a final prediction block for the current enhancement layer block.

Intra prediction, where pixel or sample values can be predicted by spatial mechanisms which involve finding and indicating a spatial region relationship.

In the syntax prediction, which may also be referred to as parameter prediction, syntax elements and/or syntax element values and/or variables derived from syntax elements are predicted from syntax elements (de)coded earlier and/or variables derived earlier.

Non-limiting examples of syntax prediction are provided below: In motion vector prediction, motion vectors eg. for inter and/or interview prediction may be coded differentially with respect to a block-specific predicted motion vector. In many video codecs, the predicted motion vectors are created in a predefined way, for example by calculating the median of the encoded or decoded motion vectors of the adjacent blocks, Another way to create moUon vector predictions, sometimes referred to as advanced motion vector prediction (AMVP), is to generate a list of candidate precflctions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaffing the chosen canthdate as the motion vector predictor.

In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted, The reference index is typicafly predicted from adjacent blocks and/or co-located blocks in temporal reference picture. Differential coding ol motion vectors is typically disabled across slice boundaries, The bbck partitioning, e.g. from CTU to CLJs and down to PUs, may be predicted.

In filter parameter prediction, the filtering parameters e.g. for sample adaptive offset may be predicted.

Prediction approaches using image information from a previously coded image can also be called as inter prediction methods which may also be referred to as temporal prediction and motion compensation. Prediction approaches using image information within the same image can also be called as intra prediction methods.

The second phase is one of coding the error between the predicted block of pixels or samples and the original block of pixels or samples. This may be accomplished by transforming the difference in pixel or sample values using a specified transform. This transform may be a Discrete Cosine Transform (OCT) or a variant thereof. After transforming the difference, the transformed difference is quantized and entropy encoded.

By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel or sample representation (i.e. the visual quality of the picture) and the size of the resulting encoded video representation (i.e. the file size or transmission bit rate).

The decoder reconstructs the output video by applying a prediction mechanism similar to that used by the encoder in order to form a predicted representation of the pixel or sample blocks (using the motion or spaflal information created by the encoder and stored in the compressed representation of the image) and prediction error decoding (the inverse operation of the prediction error coding to recover the quantized prediction error signal in the spatial domain).

After applying pixel or sample prediction and error decoding processes the decoder may combine the prediction and the prediction error signals (the pixel or sample values) to form the output video frame.

The decoder (and encoder) may also apply additional filtering processes in order to improve the quality of: the output video before passing it for display and/or storing as a prediction reference for the forthcoming pictures in the video sequence.

Filtering may be used to reduce various arUfacts such as blocking, ringing etc. from the reference images. After motion compensation foilowed by adding inverse transformed residual, a reconstructed picture is obtained. This picture may have various artifacts such as blocking, dnging etc. In order to eliminate the artifacts, various postprocessing operations may be applied. If the postrocessed pictures are used as a reference in the motion compensation loop, then the postprocessing operation&filters are usually called loop filters. By employing loop filters, the quality of the reference pictures increases. As a result, better coding efficiency can he achieved.

Filtering may comprise e.g. a deblocking filter, a Sample Adaptive Offset (SAO) filter and/or an Adaptive Loop Filter (ALF).

A deblocking fUter may be used as one of the loop filters. A deblocking filter is available in both H.264/AVC and HEVC standards, An aim of the deblocking filter is to remove the blocking artifacts occurring in the boundaries of the blocks. This may be achieved by filtering along the block boundaries.

Ui SAO. a picture is divided into regions where a separate SAC decision is made for each region. The SAC information in a region is encapsulated in a SAC parameters adaptation unit (SAC unit) and in HEVC, the basic unit for adapting SAC parameters is CTU (therefore an SAC region is the block covered by the corresponding CTU).

In the SAC algorithm, samples in a CTU are classified according to a set of rules and each classified set of samples are enhanced by adding offset values. The offset values are signaHed in the bitstream, There are two types of offsets: 1) Band offset 2) Edge offset For a CTU. either no SAC or band offset or edge offset is employed. Choice of whether no SAC or band or edge offset to be used may be decided by the encoder with e.g. rate distortion optimization (RDO) and signaled to the decoder.

In the band offset, the whole range of sample values is in some embodiments divided into 32 equalwidth bands. For example, for 8bit samples, width of a band is 8 (=256/32). Cut of 32 bands, 4 of them are selected and different offsets are signafled for each of the selected bands. The selection decision is made by the encoder and may be signaHed as foHows: The index of the first band is signalled and then it is inferred that the foflowing four bands are the chosen ones. The band offset may be useful in correcting errors in smooth regions.

In the edge offset type, the edge offset (EO) type may be chosen out of four possible types (or edge dassifications) where each type is associated with a direction: 1) vertical, 2) horizontal, 3)135 degrees diagonal, and 4) 45 degrees diagonal. The choice of the direction is given by the encoder and signalled to the decoder. Each type defines the location of two neighbour samples for a given sample based on the angle. Then each sample in the CTU is classified into one of five categories based on comparison of the sample value against the values of the two neighbour samples. The five categories are described as follows: 1. Current sample value is smaller than the two neighbour samples 2. Current sample value is smafler than one of the neighbours and equal to the other neighbour 3. Current sample value is greater than one of the neighbours and equal to the other neighbour 4. Current sample value is greater than two neighbour samples 5. None of the above These five categories are not required to be signafled to the decoder because the classificaUon is based on only reconstructed samples. which may be available and identical in both the encoder and decoder, After each sample in an edge offset type CTU is classified as one of the five categories, an offset value for each of the first four categories is determined and signalled to the decoder. The offset For each category is added to the sample values associated with the corresponding category. Edge offsets may be effective in correcting ringing artifacts.

The SAO parameters may be signalled as interleaved in CTU data. Above CTU, slice header contains a syntax element specifying whether SAO is used in the slice, If SAO is used. then two additional syntax elements specify whether SAD is applied to Cb and Cr components. For each CTU, there are three options: 1) copying SAO parameters from the left CTU, 2) copying SAO parameters from the above CTU, or 3) signalling new SAO parameters.

While a specific implementation of SAD is described above, it should be understood that other implementations of SAO, which are similar to the abovedescribed implementation, may also be possible. For example, rather than signaling SAD parameters as interleaved in CTU data, a picturebased signaling using a quad4ree segmentation may be used. The merging of SAD parameters (ia. using the same parameters than in the CTU left or above) or the quad4ree structure may be determined by the encoder for example through a ratedistortion optimization process.

The adaptive loop filter (ALF) is another method to enhance quality of the reconstructed sampes. This may be achieved by ffltedng the sample vahies in the loop. ALF is a finite impulse response (FIR) filter for which the filter coefficients are determined by the encoder and encoded into the bftstream. The encoder may choose filter coefficients that attempt to minimize distortion relative to the original uncompressed picture e.g. with a Ieastsquares method or Wiener filter optimization. The filter coefficients may for example reside in an Adaptation Parameter Set or &ice header or they may appear in the slice data for CUs in an interleaved manner with other CUspecific data.

In many video codecs, including H.264!AVC and HEVC, motion information is indicated by motion vectors associated with each motion compensated image block, Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder) or decoded (at the decoder) and the prethction source block in one of the previously coded or decoded images (or pictures). H.264/AVC and HEVC. as many other video compression standards, divide a picture into a mesh of rectangles, for each of which a similar block in one of the reference pictures is indicated for inter prediction. The location of the prediction block is coded as a motion vector that indicates the position of the prediction block relative to the block being coded.

Inter prediction process may be characterized for example using one or more of the following factors.

The accuracy of motion vector representation.

For example, motion vectors may be of quarterpixel accuracy, halfpixel accuracy or fulkpixel accuracy and sample values in fractional-pixel positions may be obtained using a finite impulse response (FIR) filter.

Many coding standards, induding H.2641AVC and HEVC, aHow selection of the size and shape of the block for which a motion vector is applied for motioncompensated prediction in the encoder, and indicating the selected size and shape in the bitstream so that decoders can reproduce the motion-compensated prediction done in the encoder.

This block may also be referred to as a motion partition.

NMmcriiinmctn!cturesfounterredictign.

The sources of inter prediction are previously decoded pictures. Many coding standards. including H264/AVC and HEVC, enable storage of muFtiple reference pictures for inter prediction and selection of the used reference picture on a block basis.

For example, reference pictures may be selected on macroblock or macroblock partition basis in H,264/AVC and on PU or CU basis in HEVC. Many coding standards, such as H.264/AVC and HEVC, include syntax structures in the bitstream that enable decoders to create one or more reference picture lists. A reference picture index to a reference picture list may be used to indicate which one of the multiple reference pictures is used for inter prediction for a particular block. A reference picture index may be coded by an encoder into the bitstream is some inter coding modes or it may be derived (by an encoder and a decoder) for example using neighboring blocks in some other inter coding modes.

Motion vector prediction.

In order to represent motion vectors efficiently in bitstreams, motion vectors may be coded dftferentiaUy with respect to a block-specific predicted motion vector. In many video codecs, the predicted motion vectors are created in a predefined way, for example by calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictions, sometimes referred to as advanced motion vector prediction (AMVP), is to generate a list of candidate predictions from adjacent blocks and/or co-ocated blocks in temporal reference pictures and signalling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index may be predicted from adjacent blocks and/or co-located blocks in temporal reference picture. Differential coding of motion vectors may be disabled across slice boundaries.

M!!ffiThYQPthcIIPP1iPThc2rnQQflatedredllcti0fl F1264/AVC and HEVC enable the use of a single prediction block in P shces (herein referred to as unipredictive slices) or a Unear combination of two motioncompensated prediction blocks for bkpredictive slices, which are also referred to as B slices.

Individual blocks in B slices may be bipredicted, unipredicted, or intrapredicted, and individual blocks in P slices may be unipredicted or intra-predicted. The reference pictures for a bi-predictive picture may not be limited to be the subsequent picture and the previous picture in output order, but rather any reference pictures may be used. In many coding standards, such as H,264/AVC and HEVC, one reference picture list, referred to as reference picture list U, is constructed for P slices, and two reference picture lists, list 0 and list 1, are constructed for B sUces. For B slices, when prediction in forward direction may refer to prediction from a reference picture in reference picture list 0, and prediction in backward direction may refer to prediction from a reference picture in reference picture list 1, even though the reference pictures for prediction may have any decoding or output order relation to each other or to the current picture.

WeiQhted prediction.

Many cothng standards use a prediction weight of 1 for prediction blocks of inter (F) pictures and 0.5 for each prediction block of a B picture (resulting into averaging), H.264/AVC allows weighted prediction for both P and B slices. In implicit weighted prediction, the weights are proportional to picture order counts, while in explicit weighted prediction, prediction weights are explicitly indicated. The weights for explicit weighted prediction may be indicated for example in one or more of the following syntax structure: a slice header, a picture header, a picture parameter set, an adaptation parameter set or any similar syntax structure.

In many video codecs, the prediction residual after motion compensation is first transformed with a transform kernel (like DCI) and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.

In a draft HEVC, each PU has prediction information associated with it defining what kind of a prediction is to be applied for the pixels wfthin that PU (e.g. motion vector information for inter predicted PUs and intra prediction directionality information for intra predicted PUs), SimUarly each TU is associated with information describing the prediction error decoding process for the samples within the TU (including e.g. DCI coefficient information). It may be signalled at CU level whether prediction error coding is applied or not for each CU. In the case there is no prediction error residual associated with the CU. it can be considered there are no TUs for the Cu.

In some coding formats and codecs, a distinction is made between so-called short-term and long-term reference pictures. This distinction may affect some decoding processes such as motion vector scaling in the temporal direct mode or implicit weighted prediction. If both of the reference pictures used for the temporal direct mode are short-terni reference pictures, the motion vector used in the prediction may be scaled according to the picture order count (POC) difference between the current picture and each of the reference pictures. However, if at least one reference picture for the temporal direct mode is a long-term reference picture. default scaling of the motion vector may be used, for example scaling the motion to half may be used. Similarly, if a short-term reference picture is used for implicit w&ghted prediction, the prediction weight may be scaled according to the POC difference between the FOG of the current picture and the POC of the reference picture. However, if a long-term reference picture is used for imp kit weighted prediction, a default prediction weight may be used, such as 0.5 in implicit weighted prediction for bi-predicted blocks.

Some video coding formats, such as H.264/AVC, include the frame num syntax element, which is used for various decoding processes related to multiple reference pictures. In H.264/AVC, the value of frame num for DR pictures is 0. The value of frame num for non-IDR pictures is equal to the frame num of the previous reference picture in decoding order incremented by 1 (in modulo arithmetic, La,, the value of frame num wrap over to 0 after a maximum value of frame num).

H,264/AVC and HEVC include a concept of picture order count (POC). A value of POC is derived for each picture and is non-decreasing with increasing picture position in output order, POC therefore indicates the output order of pictures. POC may be used in the decoding process for example for implicit scaling of motion vectors in the temporal direct mode of bi-predictive sflces, for impUcifly derived weights in weighted prediction, and for reference picture list initiahzation. Furthermore, POC may be used in the verification of output order conformance. In H.264/AVC, POC is specified relative to the previous IDR picture or a picture containing a memory management control operation marking aH pictures as "unused for reference".

H.264/AVC specifies the process for decoded reference picture marking in order to control the memory consumption in the decoder. The maximum number of reference pictures used for inter prediction, referred to as M, is determined in the sequence parameter set, When a reference picture is decoded, it is marked as "used for reference". If the decoding of the reference picture caused more than M pictures marked as "used for reference, at least one picture is marked as "unused for reference5'. There are two types of operation For decoded reference picture marking: adaptive memory control and sliding window. The operation mode for decoded reference picture marking is selected on picture basis. The adaphve memory control enables explicit signalling which pictures are marked as "unused for reference and may also assign long-term indices to short-term reference pictures. The adaptive memory control may require the presence of memory management control operation (MMCO) parameters in the bitstream. MMCO parameters may be included in a decoded reference picture marking syntax structure, If the sliding window operation mode is in use and there are M pictures marked as "used for reference", the short-term reference picture that was the first decoded picture among those short-term reference pictures that are marked as used for reference" is marked as "unused for reference. In other words, the sliding window operation mode results into first-in-first-out buffering operation among short-term reference pictures.

One of the memory management control operaflons in H.264AVC causes afi reference pictures except for the current picture to be marked as unused for reference. An instantaneous decoding refresh (IDR) picture contains only intra-coded slices and causes a simflar "reset" of reference pictures.

In a draft HEVC standard, reference picture marking syntax structures and related decoding processes are not used, but instead a reference picture set (RPS) syntax structure and decoding process are used instead for a similar purpose. A reference picture set vaild or active for a picture includes all the reference pictures used as a reference for the picture and all the reference pictures that are kept marked as "used for reference" for any subsequent pictures in decoding order. There are six subsets of the reference picture set, which are referred to as namely RefPicSetStcurrO (which may So or alternatively referred to as RefPicsetStcurrBefore), RefPicSetStCurrl (which may also or alternatively referred to as RefPicSetStCurrAfter), RefPicSetStFollO, RefPicSetStFofll, RefPicSetLtCurr, and RefpicSetLtFoil, In some HEVC draft specifIcations, RefPicSetStFollO and RefPicSetStFoIll are regarded as one subset, which may be referred to as RefPicSetStFoil. The notation of the six subsets is as foilows. urr' refers to reference pictures that are included in the reference picture lists of the current picture and hence may be used as inter prediction reference for the current picture. "Foil" refers to reference pictures that are not included in the reference picture Usts of the current picture but may be used in subsequent pictures in decoding order as reference pictures. "St' refers to short-term reference pictures, which may generally be identified through a certain number of east significant bits of their POC value, "Lt" refers to long-term reference pictures, which are specificaily identified and generaily have a greater difference of POC values relative to the current picture than what can be represented by the mentioned certain number of least significant bits. 0" refers to those reference pictures that have a smafler POC value than that of the current picture. "1" refers to those reference pictures that have a greater POC value than that of the current picture. RefPicSetStCurro, RefPicSetStCurrl, RefPicSetstFollO and RefPicSetStFofll are collectively referred to as the short4erm subset of the reference picture set. RefHcSetLtCurr and RefPicSetLtFofl are collectively referred to as the long-term subset of the reference picture set.

In a draft HEVC standard. a reference picture set may be specified in a sequence parameter set and taken into use in the slice header through an index to the reference picture saL A reference picture set may also be specified in a sUce header. A long-term subset of a reference picture set is generally specified only in a shoe header, while the short-term subsets of (he same reference picture set may he specified in the picture parameter set or sUce header. A reference picture set may he coded independenfly or may be predicted from another reference picture set (known as inter-RPS prediction).

When a reference picture set is independently coded, the syntax structure includes up to three loops iterating over different types of reference pictures; short-term reference pictures with lower POC value than the current picture, short4erm reference pictures with higher POC value than the current picture and long-term reference pictures. Each loop entry specifies a picture to be marked as "used for reference", In general, the picture is specified wfth a differential POC value. The inter-RPS prediction exploits the fact that the reference picture set of the current picture can be predicted from the reference picture set of a previously decoded picture. This is because aU the reference pictures of the current picture are either reference pictures of the previous picture or the previously decoded picture itself. It is only necessary to indicate which of these pictures should be reference pictures and he used for the prediction of the current picture. In both types of reference picture set coding, a flag (used bycurrpicXflag) is additionahy sent for each reference picture indicating whether the reference picture is used for reference by the current picture (included in a *Curr list) or not (included in a *Fo list). Pictures that are included in the reference picture set used by the current slice are marked as "used for reference', and pictures that are not in the reference picture set used by the current slice are marked as unused for reference". If the current picture is an IDR picture, RefPicsetStCurrO, RefPicsetStCurrl, RefPicSetStFollO, RefPicSetStFoUl, RefPicsetLtCurr, and RefPicSetLtFoh are aH set to empty.

A Decoded Picture Buffer (DPB) may be used in the encoder and/or in the decoder.

There are two reasons to buffer decoded pictures, for references in inter prediction and for reordering decoded pictures into output order. As H.284/AVC and HEVC provide a great deal of flexibility for both reference picture marking and output reordering.

separate buffers for reference picture buffering and output picture buffering may waste memory resources. Hence, the DPB may include a unified decoded picture buffering process for reference pictures and output reordering. A decoded picture may be removed from the DPB when it is no longer used as a reference and is not needed for output.

In many coding modes of H.2$4/AVC and HEVC, the reference picture for inter prediction is indicated with an index to a reference picture list, The index may he coded with variable length coding, which usualiy causes a smafler index to have a shorter value for the corresponding syntax element. In H.264/AVC and HEVC, two reference picture lists (reference picture list 0 and reference picture list 1) are generated for each bipredicLive (B) slice, and one reference picture list (reference picture list U) is formed for each intercoded (P) silce. In addition, for a B slice in a draft HEVC standard, a combined list (List C) is constructed after the final reference picture lists (List 0 and List 1) have been constructed. The combined list may be used for uniprediction (also known as unidirectionaI prediction) within B slices, A reference picture list, such as reference picture list 0 and reference picture list 1, may be constructed in two steps: First, an initial reference picture list is generated. The initial reference picture list may be generated for example on the basis of frame num, POC, temporalid, or information on the prediction hierarchy such as GOP structure, or any combination thereof. Second, the initial reference picture list may be reordered by reference picture list reordering (RPLR) commands, also known as reference picture list modification syntax structure, which may be contained in slice headers. The RPLR commands indicate the pictures that are ordered to the beginning of the respective reference picture list. This second step may also be referred to as the reference picture list modification process, and the RPLR commands may be included in a reference picture list modification syntax structure. If reference picture sets are used, the reference picture list 0 may be initiallzed to contain RefPicsetstCurro first, foowed by RefPicSetStCurrl, followed by RefPicSetLtCurr, Reference picture Ust 1 may be initialized to contain RefPicsetstCurrl first, followed by RefPicSetstCurro. The initial reference picture lists may be modified through the reference picture list modification syntax structure, where pictures in the inkial reference picture hsts may be identified through an entry index to the list.

The combined list in a draft HEVC standard may be constructed as follows. If the modification flag for the combined list is zero, the combined list is constructed by an impllcit mechanism; otherwise it is constructed by reference picture combination commands induded in the bitstream. In the impllcit mechanism, reference pictures in List C are mapped to reference pictures from List 0 and List 1 in an interleaved fashion starting from the first entry of List 0, followed by the first entry of LJst 1 and so forth. Any reference picLure that has afready been mapped in List C is not mapped again. In the expllcit mechanism, the number of entries in List C is signaled, followed by the mapping from an entry in List 0 or List 1 to each entry of List C. In addition, when List 0 and List I are identical the encoder has the option of setting the ref ic list combination flag to 0 to indicate that no reference pictures from List 1 are mapped, and that List C is equivalent to List 0.

The advanced motion vector prediction (AMVP) may operate for example as follows, while other similar realizations of advanced motion vector prediction are also possible for example with different candidate position sets and candidate locations with candidate position sets. Two spatial motion vector predictors (MVPs) may be derived and a temporal motion vector predictor (TMVP) may be derived. They may be selected among the positions shown in Figure 10: three spatial motion vector predictor candidate positions 103, 104, 105 located above the current prediction block 100 (80, 81, B2) and two 101, 102 on the left (A0, Al). The first motion vector predictor that is available (e.g. resides in the same slice, is intercoded, etc.) in a predeflned order of each candidate position set, (30, 31, 82) or (AU, Al), may be selected to represent that prediction direction (up or left) in the motion vector competition. A reference index for the temporal motion vector predictor may be indicated by the encoder in the sce header (e.g. as a coHocated ref idx syntax element). The motion vector obtained from the co-located picture may be scaled according to the proportions of the picture order count differences of the reference picture of the temporal motion vector predictor, the co-located picture, and the current picture. Moreover, a redundancy check may be performed among the candidates to remove identical candidates, which can lead to the inclusion of a zero motion vector in the candidate list. The motion vector predictor may he indicated in the bitstream for example by indicating the direction of the spatial motion vector predictor (up or left) or the selection of the temporal notion vector predictor candidate.

In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index may he predicted from adjacent blocks and/or from co-located blocks in a temporal reference picture.

Many high efficiency video codecs such as a draft HEVC codec employ an additional motion information coding/decoding mechanism, often caUed merging/merge mode/process/mechanism, where aU the motion information of a blockiPU is predicted and used without any modification/correction. The aforementioned motion information for a PU may comprise one or more of the following: 1) The information whether The PU is unF-predlcted using only reference picture listO' or the PU is unkpredictod using only reference picture hell or the PU is hi-predicted using both reference picture histo and histl'; 2) Motion vector value corresponding to the reference picture histo. which may comprise a horizontal and vertical motion vector component; 3) Reference picture index in the reference picture histo and/or an identifier of a reference picture pointed to by the Motion vector corresponding to reference picture list 0, where the identifier of a reference picture may be for example a picture order count value, a layer identifier value (for inter-layer prediction), or a pair of a picture order count value and a layer identifier value; 4) Information of the reference picture marldng of the reference picture, ag.

information whether the reference picture was marked as "used for short4erm reference" or used for long-term reference"; 5) 7) The same as 2) 4), respectively, but for reference picture Usti.

Simarly, predicting the motion information is carried out using the motion information of adjacent blocks and/or co-located blocks in temporal reference pictures. A list, often cafled as a merge Ust, may be constructed by including motion prediction candidates associated with available adjacent/co-located blocks and the index of selected motion prediction candidate in the list is signaHed and the motion information of the selected candidate is copied to the motion informaLion of the current PU. When the merge mechanism is employed for a whole CU and the prediction signal for the CU is used as the reconstruction signal, i.e. prediction residual is not processed, this type of coding/decoding the CU is typicaHy named as skip mode or merge based skip mode. In addition to the skip mode, the merge mechanism may also be employed for individual PUs (not necessarily the whole CU as in skip mode) and in this case, prediction residual may be utilized to improve prediction quality. This type of prediction mode is typically named as an inter-merge mode.

One of the candidates in the merge list may be a TMVP candidate, which may be derived from the collocated block within an indicated or inferred reference picture, such as the reference picture indicated for example in the slice header for example using the coUocated_refjdx syntax element or alike, In HEVC the so-cafled target reference index for temporal motion vector prediction in the merge list is set as 0 when the motion coding mode is the merge mode. When the motion coding mode in HEVC utilizing the temporal motion vector prediction is the advanced motion vector prediction mode, the target reference index values are explicitly indicated (e.g. per each PU).

When the target reference index value has been determined, the motion vector value of the temporal motion vector prediction may be derived as follows: Motion vector at the block that is co-located with the bottom-right neighbour of the current prediction unit is calculated, The picture where the co-located block resides may be e.g. determined according to the signafled reference index in the sUce header as described above. The determined motion vector at the coocated block is scaled with respect to the ratio of a first picture order count difference and a second picture order count difference, The first picture order count difference is derived between the picture containing the co-located block and the reference picture of the motion vector of the co-located block. The second picture order count difference is derived between the current picture and the target reference picture. ft one but not both of the target reference picture and the reference picture of the motion vector of the co-located block is a long-term reference picture (while the other is a short-term reference picture), the TMVP candidate may be considered unavailable, If both of the target reference picture and the reference picture of the molion vector of the co-located block are long-term reference pictures, no POC-based motion vector scaUng may be applied.

Motion parameter types or motion information may include but are not limited to one or more of the following types: an indication of a prediction type (e.g. intra prediction, uni-prediction, bi-prediction) and/or a number of reference pictures; an indication of a prediction direction, such as inter (a,k.a. temporal) prediction.

inter-layer prediction, inter-view prediction, view synthesis prediction (VSP), and inter-component prediction (which may he indicated per reference picture and/or per prediction type and where in some embodiments inter-view and view-synthesis prediction may be jointly considered as one prediction direction) and/or an indication of a reference picture type, such as a short-term reference picture and/or a long-term reference picture and/or an inter-layer reference picture (which may be indicated e.g. per reference picture): a reference index to a reference picture list and/or any other identifier of a reference picture (which may be indicated e.g. per reference picture and the type of which may depend on the prediction direction and/or the reference picture type and which may be accompanied by other relevant pieces of information, such as the reference picture list or alike to which reference index appUes); a hoñzontal motion vector component (which may be indicated e.g. per prediction block or per reference index or ake); a vertical motion vector component (which may be inthcated e.g. per prediction block or per reference index or ahke); one or more parameters, such as picture order count difference and/or a relative camera separation between the picture containing or associated with the motion parameters and its reference picture, which may be used for scaling of the horizontal motion vector component and/or the vertical motion vector component in one or more motion vector prediction processes (where said one or more parameters may be indicated e.g. per each reterence picture or each reference index or alike); coordinates of a block to which the motion parameters and/or motion information applies, e.g. coordinates of the top-left sample of the block in luma sample units; extents (e.g. a width and a height) of a block to which the motion parameters and/or motion information apphes.

A motion field associated with a picture may be considered to comprise of a set of motion information produced for every coded block of the picture. A motion field may be accessible by coordinates of a block, for example. A motion field may be used for example in TMVP or any other motion prediction mechanism where a source or a reference for prediction other than the current (de)coded picture is used.

Different spatial granularity or units may be applied to represent and/or store a motion field. For example, a regular grid of spatial units may be used. For example, a picture may be divided into rectangular blocks of certain size (with the possible exception of blocks at the edges of the picture, such as on the right edge and the bottom edge). For example, the size of the spatial unit may be equal to the smallest size for which a distinct motion can be indicated by the encoder in the bitstream, such as a 4x4 block in luma sample units. For example, a so-called compressed motion field may be used, where the spatial unit may be equal to a pre-defined or indicated size, such as a 1$x16 block in luma sample units, which size may be greater than the smallest size for indicating distinct motion, For example, an HEVC encoder and/or decoder may be implemented in a manner that a motion data storage reduction (MDSR) is performed for each decoded motion field (prior to using the motion fi&d for any prediction between pictures). In an HEVC implementation, MDSR may reduce the granurity of motion data to 16x16 blocks in luma sample units by keeping the motion appcable to the top4eft sample of the 16x16 block in the compressed motion field. The encoder may encode indication(s) related to the spatial unit of the compressed motion field as one or more syntax elements and/or syntax element values for example in a sequenceevel syntax structure, such as a video parameter set or a sequence parameter set. In some (de)coding methods and/or devices, a motion field may be represented and/or stored according to the block partitioning of the motion prediction (e.g. according to prediction units of the HEVC standard). In some (de)coding methods and/or devices, a combination of a reguar grid and block partitioning may be applied so that motion associated with partitions greater than a predeflned or indicated spatial unit size is represented and/or stored associated with those partitions, whereas motion associated with partitions smafler than or unaflgned with a pre-defined or indicated spatial unit size or grid is represented and/or stored for the predefined or indicated units.

There may be a reference picture sts combination syntax structure, created into the bitstream by an encoder and decoded from the bitstream by a decoder, which indicates the contents of a combined reference picture list. The syntax sLructure may indicate that the reference picture list 0 and the reference picture list 1 are combined to be an additional reference picture lists combination (e.g. a merge list) used for the prediction units being uni4irectional predicted. The syntax structure may include a flag which, when equal to a certain value, indicates that the reference picture list 0 and the reference picture list 1 are identical thus the reference picture list 0 is used as the reference picture lists combination. The syntax structure may include a list of entries, each specifying a reference picture list (list 0 or list 1) and a reference index to the specified list, where an entry specifies a reference picture to be included in the combined reference picture list.

A syntax structure for decoded reference picture marking may exist in a video coding system. For example, when the decoding of the picture has been completed, the decoded reference picture marking syntax structure, if present, may be used to adaptively mark pictures as "unused for reference" or "used for long-term reference". ft the decoded reference picture marking syntax structure is not present and the number of pictures marked as "used for reference" can no longer increase, a sliding window reference picture marking may be used, which basicaUy marks the earliest (in decoding order) decoded reference picture as unused for reference.

Inter-picture motion vector predicflon and its relation to scalable video coding Multi-view coding has been realized as a multi-loop scalable video coding scheme, where the inter-view reference pictures are added into the reference picture lists. In MVC the inter-view reference components and inter-view only reference components that are included in the reference picture lists may be considered as not being marked as "used for short-term reference" or "used for long-term reference", In the derivation of temporal direct luma motion vector, the co-located motion vector may not be scaled f the picture order count difference of List I reference (from which the co-located motion vector is obtained) and List 0 reference is 0, i,e. if td is equal to 0 in Figure Cc.

Figure Ga illustrates an example of spatial and temporal prediction of a prediction unit.

There is depicted the current block 601 in the frame 600 and a neighbor block 602 which already has been encoded. A motion vector definer 362 (Figure 4a) has defined a motion vector 603 for the neighbor block 602 which points to a block 604 in the previous frame 605. This motion vector can be used as a potential spatial motion vector prediction 610 for the current block. Figure 6a depicts that a co'ocated block 606 in the previous frame 605, i,e the block at the same location than the current block but in the previous frame. has a motion vector 607 pointing to a block 609 in another frame 608.

This motion vector 607 can be used as a potential temporal motion vector prediction 611 for the current block 601.

Figure Gb Ulustrates another example of spatial and temporal prediction of a prediction unit. n this example the block 606 of the previous frame 605 uses bi-directional prediction based on the block 609 of the frame 608 preceding the frame 605 and on the block 612 in the frame 613 succeeding the current frame 600. The temporal motion vector prediction for the current block 601 may he formed by using both motion vectors 607, 614 or either of them.

In HEVC temporal motion vector prethction (TMVP), the reference picture list to be used for obtaining a coflocated partiUon is chosen according to the collocatedfromlotlag syntax element in the slice header. When the flag is equa' to 1, it specifies that the picture that contains the coflocated partition is derived from st 0, otherwise the picture is derived from list 1.When coflocated from 0 flag is not present, it is inferred to be equal to 1. The coUocated ref dx in the slice header specifies the reference index of the picture that contains the coflocated partition. When the current sUce is a P slice, coflocated ref idx refers to a picture in list 0. When the current slice is a B sUce.

coflocated ref idx refers to a picture in list 0 if collocated from 10 is 1, otherwise it refers to a picture in list 1. collocated ref Jdx always refers to a valid list entry, and the resulting picture is the same for all shces of a coded picture. When collocated of dx is not present, it is inferred to be equal to 0.

In HEVC, when the current PU uses the merge mode, the target reference index for TMVP is set toO (for both reference picture list 0 and 1). In AMVP, the target reference index is indicated in the bitstream.

In HEVC, the avallability of a candidate predicted motion vector (PMV) for the merge mode may be determined as follows (both for spatial and temporal candidates) (SRTP = shorwerm reference picture, LRTP = long-term reference picture) reference picturá& j:f reference picture for candidate PM reference index candidate PMV availahillty "available" (and

STRP STRP

scaled)

STRP _______

LTRP STRP "unavailable 1. hiiiThiofl

LTRP LTRP

scaled Motion vector scaflng may be performed in the case both target reference picture and the reference index for candidate PMV are short-term reference pictures. The scaling may be performed by scaling the mofion vector with appropriate POC differences r&ated to the candidate motion vector and the target reference picture relative to the current picture, e.g. with the POC difference of the current picture and the target reference picture divided by the POC difference of the current picture and the POC difference of the picture containing the candidate PMV and its reference picture.

In Figure lIe illustrating the operation of the HEVC merge mode for multiview video (e.g. M'sJ-HEVC), the motion vector in the co-located PU, if referring to a short-term (ST) reference picture, is scaled to form a merge candidate of the current PU (PUO), wherein MVO is scaled to MVO' during the merge mode. However, if the co-located PU has a motion vector (MVI) referring to an inter-view reference picture, marked as long-term, the motion vector is not used to predict the current PU (PU I), as the reference picture corresponding to reference index 0 is a short term reference picture and the reference picture of the candidate PMV is a long-term reference picture.

In some embodiments a new additional reference index (ref idx Add, also referred to as refldxAdditional) may be derived so that the motion vectors referring to a long4erm reference picture can be used to form a merge candidate and not considered as unavaUable (when ref idx 0 points to a short-term picture). If ref idx 0 points to a short--term reference picture, refldxAdditional is set to point to the first long-term picture in the reference picture list. Vice versa, ii ref idx 0 points to a long-term picture, refldxAdditional is set to point to the first short-term reference picture in the reference picture list. refldxAdditional is used in the merge mode instead of ref idx 0 if its "type" (long-term or short-term) matches to that of the co-ocated reference index. An example of this is illustrated in Figure II b.

A coding technique known as isolated regions is based on constrainuing inpicture prediction and inter predicfion jointly. An isolated region in a picture can contain any macrobiock (or alike) locations, and a picture can contain zero or more isolated regions that do not overlap. A leftover region, if any, is the area of the picture that is not covered by any isolated region of a picture. When coding an isolated region, at least sonic types of inpicture prediction is disabled across its boundaries. A leftover region may be predicted from isolated regions of the same picture.

A coded isolated region can be decoded without the presence of any other isolated or leftover region of the same coded picture. It may be necessary to decode aH isolated regions of a picture before the leftover region. In some implementations, an isolated region or a leftover region contains at least one slice.

Pictures, whose isolated regions are predicted from each other, may be grouped into an isolatedregion picture group. An isolated region can be intefrpredicted from the corresponding isolated region in other pictures within the same isolate&region picture group, whereas inter prediction from other isolated regions or outside the isolated region picture group may be disallowed. A leftover region may be interpredicted from any isolated region. The shape, location, and size of coupled isolated regions may evolve from picture to picture in an isolate&region picture group.

Coding of isolated regions in the H.264/AVC codec may be based on slice groups. The mapping of macroblock locations to slice groups may be specified in the picture parameter set. The H.264/AVC syntax includes syntax to code certain sUce group patterns, which can be categorized into two types, static and evolving. The static slice groups stay unchanged as long as the picture parameter set is valid, whereas the evolving slice groups can change picture by picture according to the corresponding parameters in the picture parameter set and a slice group change cycle parameter in the siice header. The static slice group patterns include interleaved, checkerboard, rectangular oriented, and freeform. The evolving slice group patterns include horizontal wipe, vertical wipe, box-in, and box-out, The rectangular oriented pattern and the evolving patterns are espedaUy suited for coding of isolated regions and are described more carefuliy in the foHowing.

For a rectangular oriented slice group pattern, a desired number of rectangles are specified within the picture area. A foreground slice group includes the macroblock locations that are within the corresponding rectangle but excludes the macroblock locations that are already aliocated by slice groups specified earlier. A leftover slice group contains the macroblocks that are not covered by the foreground slice groups.

An evolving slice group is specified by indicating the scan order of macroblock locations and the change rate of the size of the slice group in number of macroblocks per picture.

Each coded picture is associated with a slice group change cycle parameter (conveyed in the slice header). The change cycle multiplied by the change rate indicates the number of macroblocks in the first slice group. The second slice group contains the rest of the macroblock locations.

In ft264/AVC inpicture prediction is disabled across slice group boundaries, because slice group boundaries lie in slice boundaries. Therefore each slice group is an isolated region or leftover region.

Each slice group has an identification number within a picture. Encoders can restrict the motion vectors in a way that they only refer to the decoded macroblocks belonging to slice groups having the same identification number as the slice group to be encoded, Encoders should take into account the fact that a range of source samples is needed in fractional pixel interpolation and afi the source samples should be within a particular slice group.

The H.2641AV0 codec includes a deblocking loop filter. Loop filtering is applied to each 4x4 block boundary, but loop filtering can be turned off by the encoder at slice boundaries. If loop filtering is turned off at slice boundaries, perfect reconstructed pictures at the decoder can be achieved when performing gradual random access.

Otherwise, reconstructed pictures may be imperfect in content even after the recovery point.

The recovery point SEI message and the motion constrained slice group set SB message of the R264/AVC standard can be used to indicate that some slice groups are coded as isolated regions with restricted motion vectors. Decoders may utilize the information for example to achieve faster random access or to save in processing time by ignoring the leftover region.

A sub-picture concept has been proposed for HEVC, which is similar to rectangular isolated regions or rectangular motion-constrained slice group sets of H.2641AVC. An mplementation of the sub-picture concept may be described as foowes, while it shouki be understood that sub-pictures may be defined otherwise simUarly but not identicaUy to what is described bSow. In this sub-picture concept, the picture is partftionod into predefined rectangular regions. Each sub-picture would be processed as an independent picture except that all sub-pictures constituting a picture share the same global information such as SPS, FF5 and reference picture sets. Sub-pictures are simUar to tiles geometrically. Their properties are as foflows: They are LCU-aligned rectangular regions specified at sequence leveL Sub-pictures in a picture may be scanned in sub-picture raster scan of the picture. Each sub-picture starts a new slice. If multiple tiles are present in a picture, sub-picture boundaries and tiles boundaries may be aligned. There may be no loop filtering across sub-pictures. There may be no prediction of sample value and motion information outside the sub-picture, and no sample value at a fractional sample position that is derived using one or more sample values outside the sub-picture may be used to inter predict any sample within the sub-picture. If motion vectors point to regions outside of a sub-picture, a padding process defined for picture boundaries may be applied. LCUs are scanned in raster order within sub-pictures unless a sub-picture contains more than one tile. Tiles within a sub-picture are scanned in tile raster scan of the sub-picture. Tiles cannot cross sub-picture boundaries except for the default one tile per picture case. All coding mechanisms that are available at picture level are supported at sub-picture level.

Many video coding standards specify buffering models and buffering parameters for bitstreams. Such buffering mod&s may be called Hypothetical Reference Decoder (HRD) or Video Buffer Verifier (VBV). A standard compflant bitstream complles with the buffering model with a set of buffering parameters specified in the corresponding standard. Such buffering parameters for a bitstream may be expflcifly or implicitly signaled. lmphcitly signaled means for example that the default buffering parameter values according to the profile and level apply. The HRD/VBV parameters are used.

among other things. to impose constraints on the bit rate variations of compliant bitstreams.

HRD conformance checking may concern for example the following two types of bitatreams: The first such type of bitstream, called Type bitstream, is a NAL unit stream containing only the VCL NAL units and filler data NAL units for all access units in the bitstream. The second type of bitstream, called a Type II bitstream, may contain, in addition to the VCL NAL units and filler data NAL units for all access units in the bitstream, additional nonVCL NAL units other than filler data NAL units and/or syntax elements such as loading zero Sbits, zero byte, start_code,,,prefix,pne 3bytes, and trailing zero Bbits that form a byte stream from the NAL unit stream, Two types of HRD parameters (NAL HRD parameters and VCL HRD parameters) may be used. The HRD parameter may be indicated through video usability information included in the sequence parameter set syntax structure.

Buffering and picture timing parameters (e.g. included in sequence parameter sets and picture parameter sets referred to in the VCL NAL units and in buffering period and picture timing 59 messages) may be conveyed to the HRD, in a fimely manner, either in the bitstream (by nonNCL NAL units), or by outofband means externally from the bitstream e.g. using a signalling mechanism, such as media parameters included in the media line of a session description formatted e.g. according to the Session Description Protocol (SDP). For the purpose of counting bits in the HRD, only the appropriate bits that are actuaHy present in the bitstream may be counted. When the content of a non VCL NAL unit is conveyed for the appflcafion by some means other than presence within the bitstream, the representation of the content of the nonNCL NAL unit may or may not use the same syntax as would be used if the nonVCL NAL unit were in the bitstream.

The HRD may contain a coded picture buffer (CPB), an instantaneous decoding process, a decoded picture buffer (OPB), and output cropping.

The CPB may operate on decoding unit basis. A decoding unit may be an access unit or it may be a subset of an access unit, such as an integer number of NAL units. The selection of the decoding unit may be indicated by an encoder in the bitstream.

The HRD may operate as follows. Data associated with decoding units that flow into the CPB according to a specified arrival schedule may be delivered by the HypothetiS Stream Scheduler (HSS). The arrival schedule may be determined by the encoder and indicated for example through picture timing SE] messages, and/or the arrival schedule may be derived for example based on a bitrate which may be indicated for example as part of HRD parameters in video usabifity information (which may be included in the sequence parameter set). The HRD parameters in video usabi]ity information may contain many sets of parameters, each for different bitrate or delivery schedule. The data associated with each decoding unit may he removed and decoded instantaneously by the instantaneous decoding process at CPB removal times. A CPB removal time may be determined for example using an initial CPB buffering delay, which may be determined by the encoder and indicated for example through a buffering period SEi message, and differential removal delays indicated for each picture for example though picture timing SE] messages. Each decoded picture is placed in the DPB. A decoded picture may be removed from the DPB at the later of the DPB output time or the time that it becomes no longer needed for interprediction reference. Thus, the operation of the CPB of the HRD may comprise timing of bitstream arrival, timing of decoding unit removal and decoding of decoding unit, whereas the operation of the DPB of the HRD may comprise removal of pictures from the DPB, picture output, and current decoded picture marking and storage.

The HRD may be used to check conformance of hitstreams and decoders.

Bitstream conformance requirements of the HRD may comprise for example the foUowing and/or ahke. The CPB is required not to overflow (relative to the size which may be indicated for example within HRD parameters of video usabUity information) or underilow (Le. the removal time of a decoding unit cannot be smaller than the arrival time of the last bit of that decoding unit). The number of pictures in the DPB may be required to be smaller than or equal to a certain maximum number, which may be indicated for example in the sequence parameter set, All pictures used as prediction references may be required to be present in the DPB. It may be required that the interval for outputting consecutive pictures from the DPB is not smaller than a certain minimum.

Decoder conformance requirements of the HRD may comprise for example the following and/or alike. A decoder claiming conformance to a specific profile and level may be required to decode successfully all conforming bitstreams specified for decoder conformance provided that all sequence parameter sets and picture parameter sets referred to in the VCL NAL units, and appropriate buffering period and picture timing SD messages are conveyed to the decoder, in a timely manner, either in the bitstream (by nonVCL NAL units), or by external means. There may be two types of conformance that can be claimed by a decoder: output timing conformance and output order conformance.

To check conformance of a decoder, test bitstreams conforming to the claimed profile and level may be delivered by a hypothetical stream scheduler (HSS) both to the HRD and to the decoder under test (DUT). All pictures output by the HRO may also be required to be output by the DUT and, for each picture output by the HRD, the values of aM samples that are output by the DUT for the corresponding picture may also be requfred to be equal to the values of the samples output by the HRD.

For output timing decoder conformance, the HSS may operate e.g. with delivery schedules s&ected from those indicated in the HRD parameters of video usabflity information, or with "interpolated delivery schedules, The same delivery schedule may he used for both the HRD and DUT, For output timing decoder conformance, the timing (relative to the dehvery time of the first bit) of picture output may be required to he the same for both HRD and the DUT up to a fixed delay.

For output order decoder conformance, the HSS may deliver the bitstream to the OUT hy demand" from the OUT, meaning that the HSS defivers bits (in decoding order) only when the DUT requires more bits to proceed with its processing. The HSS may deUver the bitstream to the HRD by one of the schedules specified in the bitstream such that the bit rate and CPB size are restricted. The order of pictures output may be required to be the same for both HRD and the DUT.

Scalable video coding refers to a coding structure where one bitstream can contain multiple representations of the content at different bitrates, resolutions and/or frame rates. In these cases the receiver can extract the desired representation depending on its characteristics (e.g. resolution that matches best with the resolution of the display of the device). Alternatively, a server or a network element can extract the portions of the bitstream to be transmitted to the receiver depending on e.g. the network characteristics or processing capabilities of the receiver.

A scalable bitstream may consist of a base layer providing the lowest quality video available and one or more enhancement ayers that enhance the video quality when received and decoded together with the lower ayers. An enhancement layer may enhance the temporal resolution (i.e., the frame rate), the spatial resolution, or simply the quality of the video content represented by another layer or part thereof. In order to improve coding efficiency for the enhancement layers, the coded representation of that layer may depend on the lower layers. For example, the motion and mode information of the enhancement layer can be predicted from lower layers. SimUarly the pixel data of the lower layers can be used to create prediction for the enhancement layer(s).

Each scalable layer together with aU its dependent layers is one representation of the video signal at a certain spatial resolution, temporal resolution and quality level. In this document, we refer to a scalable layer together with all of its dependent layers as a "scalable layer representation". The portion of a scalable bitstream corresponding to a scalable layer representation can be extracted and decoded to produce a representation of the original signal at certain fidelity.

In some cases, data in an enhancement layer can be truncated after a certain location, or even at arbitrary positions, where each truncation position may include additional data representing increasingly enhanced visual quality. Such scalability is referred to as fine-grained (granularity) scalahility (FGS). FGS was induded in some draft versions of the SVC standard, but it was eventually excluded from the final SVC standard. FGS is subsequently discussed in the context of some draft versions of the SVC standard. The scalability provided by those enhancement layers that cannot be truncated is referred to as coarse-grained (granularity) scalahility (CGS), It collectively includes the traditional quality (SNR) scalability and spatial scalability. The SVC standard supports the so-called medium-grained scalability (MGS), where quality enhancement pictures are coded similarly to SNR scalable layer pictures but indicated by high-level syntax elements similarly to FGS layer pictures, by having the quafityid syntax element greater than 0.

SVC uses an inter-layer prediction mechanism, wherein certain information can be predicted from layers other than the currently reconstructed layer or the next lower layer. Information that could be inter-layer predicted indudes intra texture, motion and residual data. Inter-layer motion prediction includes the prediction of block coding mode, header information, etc,, wherein motion from the lower layer may he used for prediction of the higher layer. In case of intra coding, a prediction from surrounding macroblocks or from co-located macroblocks of lower layers is possible. These predicUon techniques do not employ information from earfler coded access units and hence, are referred to as intra prediction techniques. Furthermore, residual data from lower layers can also be employed for prediction of the current layer, which may be referred to as inter-layer residual predicUon.

SVC specifies a concept known as single-loop decoding. It is enabled by using a constrained intra texture prediction mode, whereby the inter'ayer intra texture prediction can be applied to macroblocks (MBs) for which the corresponding block of the base layer is located inside intra-MBs. At the same time, those intra-M13s in the base layer use constrained intra-prediction (e.g., having the syntax element constrainedJntrapredflag equal to 1). In single-loop decoding, the decoder performs motion compensation and full picture reconstruction only for the scalable layer desired for playback (called the "desired laye( or the "target layer"), thereby greatly reducing decoding complexity. All of the layers other than the desired layer do not need to be fully decoded because all or part of the data of the M13s not used for inter-layer prediction (be it inter-layer intra texture prediction, inter-layer motion prediction or inter-layer residual prediction) is not needed for reconstruction of the desired layer, A single decoding oop is needed for decoding of most pictures, while a second decoding loop is selectively applied to reconstruct the base representations, which are needed as prediction references but not for output or display, and are reconstructed only for the so called key pictures (for which "storejefbase pic flag" is equal to 1).

The scalabillty structure in the SVC draft is characterized by three syntax elements: "emporalid," "dependencyjd" and "quality id" The syntax element "temporaMd" is used to indicate the temporal scalability hierarchy or, indirecfly, the frame rate. A scalable layer representation comprising pictures of a smaller maximum rtemporaHdn value has a smaller frame rate than a scalable layer representation comprising pictures of a greater maximum "temporal Id". A given temporal layer typically depends on the lower temporal layers (i.e., the temporal layers with smaller "temporaHd" values) but does not depend on any higher temporal layer. The syntax element 5dependency id' is used to indicate the CCS inter-layer coding dependency hierarchy (which, as mentioned earller, indudes both SNR and spatial scalabilfty). At any temporal level location, a picture of a smaller "dependency id' value may be used for inteNlayer prediction for coding of a picture with a greater "dependencyid" value. The syntax element "qualityid" is used to indicate the quallty lev& hierarchy of a FGS or MOS layer. At any temporal location, and with an identical "dependency id' value, a picture with aqualityid equal to QL uses the picture with "quality id" equal to QL1 for interayer prediction. A coded slice with "quality id" larger than 0 may be coded as either a truncatable FOS slice or a non4runcatable MOS slice.

For simplicity, all the data units (e,g,. Network Abstraction Layer units or NAL units in the SVC context) in one access unit having identical value of "dependency Id" are referred to as a dependency unit or a dependency representation. Within one dependency unit, all the data units having identical value of quality Id" are referred to as a quality unit or layer representation.

A base representation, also known as a decoded base picture, is a decoded picture resulting from decoding the Video Coding Layer (VCL) NAL units of a dependency unit having "quality id" equal to 0 and for which the "store ref base pic flag' is set equal to 1. An enhancement representation, also referred to as a decoded picture, results from the regular decoding process in which all the layer representations that are present for the highest dependency representation are decoded.

As mentioned earlier, COS includes both spatial scalability and SNR scalability. SpatiS scalabifity is initially designed to support representations of video with different resolutions. For each time instance. VCL NAL units are coded in the same access unit and these VCL NAL units can correspond to different resolutions, During the decoding, a low resolution VCL NAL unit provides the motion field and residual which can be optionally inherited by the final decoding and reconstruction of the high resolution picture. When compared to older video compression standards, SVC's spatial scalahility has been generalized to enable the base layer to be a cropped and zoomed version of the enhancement layer.

MGS quaUty layers are indicated with "qu&ity Id" similarly as FGS quality layers. For each dependency unit (with the same "dependency idi, there is a layer with "qualityid" equal to 0 and there can be other layers with "quaty Id" greater than 0. These layers with "quaflty Jd" greater than 0 are efther MGS layers or FOS layers, depenthng on whether the slices are coded as truncatable slices, In the basic form of FGS enhancement layers, only inter-layer prediction is used.

Therefore, FOS enhancement layers can be truncated freely without causing any error propagation in the decoded sequence. However, the basic form of FOS suffers from low compression efficiency. This issue arises because only low-quality pictures are used for inter prediction references. It has therefore been proposed that FOS-enhanced pictures be used as inter prediction references, However, this may cause encodingdecoding mismatch, also referred to as drift, when some FOS data are discarded.

One feature of a draft SVC standard is that the FOS NAL units can be freely dropped or truncated, and a feature of the SVCV standard is that MGS NAL units can be freely dropped (but cannot be truncated) without affecting the conformance of the bitstream, As discussed above, when those FGS or MOS data have been used for inter prediction reference during encoding, dropping or truncation of the data would result in a mismatch between the decoded pctures in the decoder side and in the encoder side. This mismatch is also referred to as drift.

To control drift due to the dropping or truncation of FOS or MOS data, SVC applied the following solution: In a certain dependency unit, a base representation (by decoding only the OGS picture with "quality id' equal to 0 and all the dependent-on lower layer data) is stored in the decoded picture buffer. When encoding a subsequent dependency unit with the same value of "dependencyjd,' aH of the NAL units, including FGS or MGS NAL units, use the base representation for inter prediction reference.

Consequently, all drift due to dropping or truncation of FGS or MGS NAL units in an earher access unft is stopped at this access unit. For other dependency units with the same value of "dependency id," all of the NAL units use the decoded pictures for inter prediction reference, for high coding efficiency.

Each NAL unit indudes in the NAL unit header a syntax element use ref base plc flag.' When the value of this element is equal to 1, decoding of the NAL unit uses the base representations of the reference pictures during the inter prediction process. The syntax Sement store ref baspic flag" specifies whether (when equal to 1) or not (when equal to 0) to store the base representation of the current picture for future pictures to use for inter prediction.

NAL units with quaflty Id" greater than 0 do not contain syntax elements related to reference picture fists construction and weighted prediction, i.e., the syntax elements "numrefactivexminus1" (x=0 or 1), the reference picture fist reordering syntax table, and the weighted prediction syntax table are not present. Consequently, the MGS or FGS layers have to inherit these syntax elements from the NAL units with "qualityid" equal to 0 of the same dependency unit when needed.

In SVC, a reference picture list consists of either only base representations (when "use ref base pic flag" is equal to 1) or only decoded pictures not marked as "base representation (when use_ref basepic flag" is equal to 0), but never both at the same time.

In an H.2641AVC bit stream, coded pictures in one coded video sequence uses the same sequence parameter set, and at any time instance during the decoding process, only one sequence parameter set is active. In SVC, coded pictures from different scalable layers may use different sequence parameter sets. If different sequence parameter sets are used, then, at any time instant during the decoding process, there may be more than one active sequence picture parameter set. In the SVC specification, the one for the top layer is denoted as the active sequence picture parameter set, while the rest are referred to as layer active sequence picture parameter sets. Any given active sequence parameter set remains unchanged throughout a coded video sequence in the layer in which the active sequence parameter set is referred to.

A scalable nesting SD message has been specified in SVC. The scalable nesting SEI message provides a mechanism for associating SD messages with subsets of a bitstream, such as indicated dependency representations or other scalable layers. A scalable nesting SD message contains one or more SB messages that are riot scalable nesting SD messages themselves. An SB message contained in a scalable nesting SD message is referred to as a nested SD message. An SEt message not contained in a scalable nesting SEI message is referred to as a nonnested SB message.

As indicated earlier, MVC is an extension of H.264/AVC.

Many of the definitions, concepts, syntax structures, semantics, and decoding processes of H.264!AVC apply also to MVC as such or with certain generazations or constraints. Sonic definitions, concepts, syntax structures, semantics, and decoding processes of MVC are described in the following.

An access unit in MVC is defined to be a set of NAL units that are consecutive in decoding order and contain exactly one primary coded picture consisting of one or more view components, In addition to the primary coded picture, an access unit may also contain one or more redundant coded pictures, one auxiliary coded picture, or other NAL units not containing slices or slice data partitions of a coded picture. The decoding of an access unit results in one decoded picture consisting of one or more decoded view components, when decoding errors, bitstream errors or other errors which may affect the decoding do not occur. In other words, an access unit in MVC contains the view components of the views for one output time instance.

A view component in MVC is referred to as a coded representation of a view in a singe access unit nter-view prediction may be used in MVC and refers to prediction of a view component from decoded samples of different view components of the same access unit. In MVC, interview prediction is realized similarly to inter predicflon. For example, inter-view reference pictures are placed in the same reference picture list(s) as reference pictures for inter prediction, and a reference index as well as a motion vector are coded or inferred similarly for inter-view and inter reference pctures.

An anchor picture is a coded picture in which all slices may reference only slices within the same access unit, i.e., inter-view prediction may be used, but rio inter prediction is used, and aV folowing coded pictures in output order do not use inter prediction from any picture prior to the coded picture in decoding order. Inter-view prediction may be used for IOR view components that are part of a non-base view. A base view in MVC is a view that has the minimum value of view order index in a coded video sequence. The base view can be decoded independently of other views and does not use interview prediction. The base view can be decoded by H.264/AVC decoders supporting only the single-view profiles, such as the Baseline Profile or the High Profile of H,2641AVC.

In the MVC standard, many of the sub-processes of the MVC decoding process use the respective sub-processes of the H.2641AVC standard by replacing term "picture", "frame", and "field" in the sub-process specification of the H.264/AVC standard by "view component", "frame view component", and "field view component", respectiv&y.

Likewise, terms "picture", "rams", and "field" are often used in the following to mean "view component", "frame view component", and "field view component", respectively.

As mentioned earlier, non-base views of MVC bitstreams may refer to a subset sequence parameter set NAL unit. A subset sequence parameter set for MVC includes a base EPS data structure and a sequence parameter set MVC extension data structure. In MVC, coded pictures from different views may use different sequence parameter sets. An SPS in MVC (specificaUy the sequence parameter set MVC extension part of the SPS in MVC) can contain the view dependency information for inteNview predicUon. This may be used for example by signalingaware media gateways to construct the view dependency tree.

In the context of multiview video coding, view order index may be defined as an index that indicates the decoding or bitstream order a view components in an access urdt, In MVC, the interview dependency relationships are indicated in a sequence parameter set MVC extension, which is induded in a sequence parameter set. According to the MVC sLandard, aU sequence parameter set MVC extensions that are referred to by a coded video sequence are required to be identicaL The foUowing excerpt of the sequence parameter set MVC extension provides further details on the way inter-view dependency relationships are indicated in MVC, arar CiDecrip tar num views minusi 0 ue(v) Tr -.-... .-.-.-.-......-...

view kJ[i] ue(v) for( = 1J CnUm views mnusl, ft+ ){ iuirn Ji_i.b 1Jm ue(v numanchorretsl1[ i] 0 ue(v) ii anchorjefjl[ ][j] 0 ue(v) for( = 1; i a num views minusi; nurn non anchor rots lO[ i] 0 ue(v) for( j = 0, j < num_nonanchorrefslo[ j, jt) nonjrnchorjetlO[ I][ j J 0 ue(v) num ?_i±t ior( j = 0; j <nurn_nonanchor refs VI[ j; j++ ) $8 In MVC decoding process, the variable VOldx may represent the view order index of the view identified by view id (which may be obtained from the MVC NAL unit header of the coded sUce being decoded) and may be set equal to the value of i for which the syntax element viewjd[ i] induded in the referred subset sequence parameter set is equS to viewid.

The semantics of the sequence parameter set MVC exten&on may be specified as foHows. numyiews_minus1 pius 1 specifies the maximum number of coded views in the coded video sequence, The actual number of views in the coded video sequence may be less than num views minusi plus 1. view id[ I] specifies the view Id of the view with VOIdx equal to I. nurn anchor rets lO[ i] specifies the number of view components for inter-view prediction in the initial reference picture list RelPicListO in decoding anchor view components with VOldx equal to i. anchorjefjO[ I][j] specifies the view id of the j4h view component for inter-view prediction in the initial reference picture Ust RefPicListO in decoding anchor view components with VOldx equal to i.

num,,,,anchorjefsjl[ I] specifies the number of view components for inter-view prediction in the initial reference picture list RefPicListl in decoding anchor view components with VOldx equa to i. anchorj'efjl[ I J[j] specifies the view Id of the j-th view component for interwiew prediction in the initial reference picture list RefPicListl in decoding an anchor view component with VOldx equal to i. numnonanchorjefsJO[ i] specifies the number of view components for inter-view prediction in the initial reference picture list RefRicListo in decoding non-anchor view components with VOldx equal to non anchor ref lU[ i][j I specifies the view id of the j4h view component for inter-view prediction in the initial reference picture list RefPicListO in decoding non-anchor view components with VOldx equal to i. nurn non anchor refs If i] specifies the number of view components for inter-view prediction in the initial reference picture list RefPicListl in decoding non-anchor view components with VOldx equal to i.

non anchor ref i][j J specifies the view Id of the j-th view component for inter-view prediction in the initial reference picture Ust RefPicListl in decoding nonanchor view components with VOldx equal to L For any particular view with view Jd equal to vidi and VOldx equal to vOldxl and another view with view Id equal to vld2 and VOldx equal to vOldx2, when vld2 is equal to the value of one of non anchor ref J0[ vOldxlJU] for all j in the range of 0 to num non anchor refs l0[ vOEdxl], exclusive, or one of non anchor ref l1[vOldxl][j] for aU j in the range 0 num non anchor refs J1[vOldxl], exclusive, vEd2 is also required to be equal to the value of one of anchor ref l0[ vOldxl][ j] for all j in the range of 0 to num anchor refs JOf vOldxl], exdusive, or one of anchor ref l1[vOldxl][j] for all j in the range of 0 to numanchorrefsll[ vOldxl], exclusive. The inter-view dependency for non-anchor view components is a subset of that for anchor view components.

In MVC, an operation point may be defined as foflows: An operation point is identified by a temporal Id value representing the target temporal level and a set of view id values representing the target output views. One operation point is associated with a bitstream subset, which con&sts of the target output views and all other views the target output views depend on, that is derived using the sub-bitstream extraction process with tidlarget equal to the temporal id value and viewldlargetList consisting of the set of view Id values as inputs. More than one operation point may be associated with the same bitstream subset. When "an operation point is decoded", a bitstream subset corresponding to the operation point may be decoded and subsequently the target output views may be output.

In SVC and Mvc, a prefix NAL unit may be defined as a NAL unit that immediately precedes in decoding order a VCL NAL unit for base layer/view coded slices. The NAL unit that immediately succeeds the prefix NAL unit in decoding order may be referred to as the associated NAL unit. The prefix NAL unit contains data associated with the associated NAL unit, which may be considered to be part of the associated NAL unit.

The prefix NAL unit may be used to indude syntax elements that affect the decoding of the base layer/view coded slices, when SVC or MVC decoding process is in use. An H.2641AVC base layer/view decoder may omit Lhe prefix NAL unit in its decoding process.

In scalable muftiview coding, the same bitstream may contain coded view components of multiple views and at least some coded view components may be coded using quality and/or spatial scalability.

There are ongoing standardization activities for depthenhanced video coding where both texture views and depth views are coded.

A texture view refers to a view that represents ordinary video content, for example has been captured using an ordinary camera, and is usually suitable for rendering on a display. A texture view typicaUy comprises pictures having three components, one luma component and o chrome components. In the following, a texture picture typically comprises all its component pictures or color components unless otherwise indicated for example with terms luma texture picture and chroma texture picture.

Ranging information for a particular view represents distance information of a texture sample from the camera sensor, dispariLy or parallax information between a texture sample and a respective texture sample in another view, or similar information.

Ranging information of realword 3D scene depends on the content and may vary for example from 0 to infinity. Different types of representation of such ranging information can be utilized, Below some nonlimiting examples of such representations are given.

Depth value, Reakworld 3D scene ranging information can be directly represented with a depth value (Z) in a fixed number of bits in a floating point or in fixed point arithmetic representation. This representation (type and accuracy) can be content and application specific. Z value can be converted to a depth map and disparity as it is shown below.

Depth map value, To represent real-world depth value with a finite number of bits, e.g. 8 bits, depth values Z may be non-Unearly quantized to produce depth map values d as shown below and the dynamical range of represented Z are limited with depth range parameters Znear/Zfar, d=(2NI). ZZj ±o - Lç, In such representation. N is the number of bits to represent the quantization levels for the current depth map, the dosest and farthest real-world depth values Znear and Zfar, corresponding to depth values (2N1) and 0 in depth maps, respectively. The equation above could be adapted for any number of quantization levels by replacing 2N with the number of quantization levels. To perform forward and backward conversion between depth and depth map, depth map parameters (Znear/Zfar, the number of bits N to represent quantization levels) may be needed.

Disparity map value, Every sample of the ranging data can be represented as a disparity value or vector (difference) of a current image sample location between two given stereo views, For conversion from depth to disparity, certain camera setup parameters (namely the focal length f and the translation distance I between the two cameras) may be required: z Disparity D may be calculated out of the depth map value v with the following equation: 1 1 (22 1)LZnecr Z1 Zj&r) Disparity 0 may be calculated out of the depth map value v with following equation: D = (w v + o)>> n, where w is a scale factor, o is an offset value, and n is a shift parameter that depends on the required accuracy of the disparity vectors. An independent set of parameters w, o and n requfted for this conversion may be required for every pair of views.

Other forms of ranging information representation that take into consideration real world 3D scenery can be deployed.

A depth view refers to a view that represents distance information of a texture sample from the camera sensor, disparity or paraflax information between a texture sample and a respective texture sample in another view, or simUar information. A depth view may comprise depth pictures (a.k.a. depth maps) having one component, similar to the luma component of texture views. A depth map is an image with per-pixel depth information or simUar. For example, each sample in a depth map represents the distance of the respective texture sample or samples from the plane on which the camera lies. In other words, if the z axis is along the shooting axis of the cameras (and hence orthogonal to the plane on which the cameras lie), a sample in a depth map represents the value on the z axis. The semantics of depth map values may for example include the foflowing: Each luma sample value in a coded depth view component represents an inverse of real-world distance (Z) value, i.e. lIZ, normalized in the dynamic range of the luma samples, such as to the range of 0 to 255, inclusive, for 8-bit luma representation. The normalization may be done in a manner where the quantization lIZ is uniform in terms of disparity.

Each luma sample value in a coded depth view component represents an inverse of real-world distance (Z) value, La. lIZ, which is mapped to the dynamic range of the luma samples, such as to the range of 0 to 255, inclusive, for 8-bit luma representation, using a mapping function f(1IZ) or table, such as a piece-wise linear mapping. In other words. depth map values result in applying the function f(I1Z).

Each luma sample value in a coded depth view component represents a real-world distance (Z) value normalized in the dynamic range of the luma samples such as to the range of 0 to 255, inclusive, for 8-bit luma representation. 1 03

Each luma sample value in a coded depth view component represents a disparity or paraUax value from the present depth view to another indicated or derived depth view or view position.

The semantics of depth map values may be indicated in the bitstream for example within a video parameter set syntax structure, a sequence parameter set syntax.

structure, a video usability information syntax structure, a picture parameter set syntax structure, a camera/depth/adaptation parameter set syntax structure, a supplemental enhancement information message, or anything alike.

While phrases such as depth view, depth view component, depth picture and depth map are used to describe various embodiments, it is to be understood that any semantics of depth map values may be used in various embodiments including but not Umited to the ones described above, For example, embodiments of the invention may be appUed for depth pictures where sample values indicate disparity values.

An encoding system or any other entity creating or modifying a bitstream including coded depth maps may create and include information on the semantics of depth samples and on the quantization scheme of depth samples into the bitstream. Such information on the semantics of depth samples and on the quantization scheme of depth samples may be for example included in a video parameter set structure, in a sequence parameter -set structure, or in a SE message.

Depth-enhanced video refers to texture video having one or more views associated with depth video having one or more depth views. A number of approaches may be used for representing of depth-enhanced video, including the use of video plus depth (V+D), muftiview video plus depth (MVD), and layered depth video (LDV). In the video plus depth (V±D) representation, a single view of texture and the respective view of depth are represented as sequences of texture picture and depth pictures, respectively. The MVO representation contains a number of texture views and respective depth views. In the LDV representation, the texture and depth of the central view are represented conventionaUy, while the texture and depth of the other views are partially represented and cover only the dis-occluded areas required for correct view synthesis of intermediate views.

A texture view component may be defined as a coded representation of the texture of a view in a single access unit. A texture view component in depthenhanced video bitstream may be coded in a manner that is compatible with a singleview texture bitstream or a multiview texture bitstream so that a singlewiew or multLview decoder can decode the texture views even if it has no capability to decode depth views. For example, an H.264/AVC decoder may decode a single texture view from a depth-enhanced H.264/AVC bitstream. A texture view component may alternatively be coded in a manner that a decoder capable of singleview or multi-view texture decoding, such H.2641AVC or MVC decoder, is not able to decode the texture view component for example because it uses depth-based coding tools. A depth view component may be defined as a coded representation of the depth of a view in a single access unit. A view component pair may be defined as a texture view component and a depth view component of the same view within the same access unit.

Depth-enhanced video may be coded in a manner where texture and depth are coded independently of each other, For example, texture views may be coded as one MVC bitstream and depth views may be coded as another MVC bitstream, Depth-enhanced video may also be coded in a manner where texture and depth are jointly coded. In a form of a joint coding of texture and depth views, some decoded samples of a texture picture or data elements for decoding of a texture picture are predicted or derived from some decoded samples of a depth picture or data elements obtained in the decoding process of a depth picture. Alternatively or in addition, some decoded samples of a depth picture or data elements for decoding of a depth picture are predicted or derived from some decoded samples of a texture picture or data &ements obtained in the decoding process of a texture picture. In another option, coded video data of texture and coded video data of depth are not predicted from each other or one is not coded/decoded on the basis of the other one, but coded texture and depth view may be multiplexed into the same bitstream in the encoding and demultiplexed from the bitstream in the decoding. In yet another option, while coded video data of texture is not predicted from coded video data of depth in eg. below slice layer, some of the high level coding structures of texture views and depth views may be shared or predicted from each other. For example, a slice header of coded depth slice may be predicted from a slice header of a coded texture slica Moreover, some of the parameter sets may be used by both coded Lexture views and coded depth views.

Depthenhanced video formats enable generation of virtual views or pictures at camera positions that are noL represented by any of the coded views. Generafly, any depth imag&based rendering (DIBR) algorithm may be used for synthesizing views.

A simpfied model of a DlBRbased 3DV system is shown in Figure 8. The input of a 3D video codec comprises a stereoscopic video and corresponding depth information with stereoscopic baseline bO. Then the 3D video codec synthesizes a number of virtual views between two input views with baseline (bi < bO). DIBR algorithms may also enable extrapolation of views that are outside the two input views and not in between them. Similarly, DIBR algorithms may enable view synthesis from a single view of texture and the respective depth view. However, in order to enable DlBRbased multiview rendering, texture data should be available at the decoder side along with the corresponding depth data.

In such 3DV system, depth information is produced at the encoder side in a form of depth pictures (also known as depth maps) for texture views.

Depth information can be obtained by various means. For example, depth of the 3D scene may be computed from the disparity registered by capLuring cameras or colour image sensors. A depth estimation approach, which may also be referred to as stereo matching, takes a stereoscopic view as an input and computes local disparities between the two offset images of the view. Since the two input views represent different viewpoints or perspectives, the parallax creates a disparity between the relative positions of scene points on the imaging planes depending on the distance of the ponts. A target of stereo matching is to extract those disparities by finding or detecting the corresponding points between the images. Several approaches for stereo matching exist. For example, in a block or template matching approach each image is processed pixel by pixel in ovedapping blocks, and for each block of pixSs a horizontafly localized search for a matching block in the offset image is performed. Once a pixel-wise disparity is computed, the corresponding depth value z is calculated by equation (1): f* b (1) d+Ad 2 where f is the focal length of the camera and b is the baseline distance between cameras, as shown in Figure 9. Further, d may be considered to refer to the disparity observed between the two cameras or the disparity estimated between corresponding pixels in the two cameras. The camera offset L\d may be considered to reflect a possible horizontal misplacement of the optical centers of the two cameras or a possible horizontal cropping in the camera frames due to preprocessing. However, since the algorithm is based on block matching, the quality of a depth1hrough-disparity estimation is content dependent and very often not accurate. For example, no straightforward solution for depth estimation is possible for image fragments that are featuring very smooth areas with no textures or large level of noise.

Alternal.ively or in addition to the abovedescribed stereo view depth estimation, the depth value may be obtained using the time-of$light (TOF) principle for example by using a camera which may be provided with a light source, for example an infrared emitter, for illuminating the scene. Such an illuminator may be arranged to produce an intensity modulated electromagnetic emission for a frequency between e.g. 10-100 MHz, which may require LEDs or laser diodes to be used. Infrared light may be used to make the illumination unobtrusive. The light reflected from objects in the scene is detected by an image sensor, which may be modulated synchronously at the same frequency as the illuminator. The image sensor may be provided with optics; a lens gathering the reflected light and an optical bandpass filter for passing only the light with the same wavelength as the illuminator, thus helping to suppress background light. The image sensor may measure for each pixel the time the light has taken to travel from the illuminator to the object and back, The distance to the object may be represented as a phase shift in the illumination modulation, which can be determined from the sampled data simultaneously for each pixel in the scene.

Alternativ&y or in addition to the abovedescribed stereo view depth estimation and/or TOFprinciple depth sensing, depth values may be obtained using a structured light approach which may operate for example approximately as follows, A Hght emitter, such as an infrared laser emitter or an infrared LED emitter, may emit light that may have a certain direction in a 3D space (eg. follow a rastefrscan or a pseudorandom scanning order) and/or position within an array of light emitters as well as a certain pattern, e.g. a certain wavelength and/or amplitude pattern. The emitted light is reflected back from objects and may be captured using a sensor, such as an infrared image sensor. The image/signals obtained by the sensor may be processed in relation to the direction of the emitted light as well as the pattern of the emitted Ught to detect a correspondence between the received signal and the direction/position of the emitted lighted as well as the pattern of the emitted Ught for example using a triangulation principle. From this correspondence a distance and a position of a pixel may be concluded.

It is to be understood that the abovedescribed depth estimation and sensing methods are provided as nonlimiting examples and embodiments may he realized with the described or any other depth estimation and sensing methods and apparatuses.

Disparity or parallax maps, such as parallax maps specified in ISO/IEC International Standard 23002-3, may be processed similarly to depth maps. Depth and disparity have a straightforward correspondence and they can be computed from each other through mathematical equation.

Tecture views and depth views may be coded into a singe bitstream where some of the texture views may be compatible with one or more video standards such as H.264/AVC and/or MVC. In other words, a decoder may be able to decode some of the texture views of such a bftstream and can omit the remaining texture views and depth views.

An amendment has been specified for the H.264/AVC for depth map coding. The amendment is called MVC extension for inclusion of depth maps and may be referred to as MVC+D, The MVC+D amendment specifies the encapsulation of texture views and depth views into the same bitstream in a manner that the texture views remain compatible with H.264/AVC and MVC so that an MVC decoder is able to decode all texture views of an MVC±D bitstream and an H.264/AVC decoder is able to decode the base texture view of an MVC+D bitstream. Furthermore, the VCL NAL units of the depth view use identical syntax, semantics, and decoding process to those of texture views below the NAL unit header.

Development of another amendment for the H.2$4/AVC is ongoing at the time of writing this patent application. This amendment, referred to as 3D-AVC, requires at least one texture view to be FL264/AVC compatible while further texture views may be (but need not be) MVC compatible.

An encoder that encodes one or more texture and depth views into a single H.264/AVC and/or MVC compatible bitstream may be called as a 3DVATM encoder, Bitstreams generated by such an encoder may be referred to as 3DVATM bitstreams and may be either MVC+D bitstreams or 30-AVC bitstreams, The texture views of 3DV-ATM bitstreams are compatible with H2641AVC (for the base view) and may be compatible with MVC (always in the case of MVC+D bitstreams and as selected by the encoder in 3DAVC bitatreams). The depth views of 3DV-ATM bitstreams may be compatible with MVC+D (always in the case of MVC+D bitstreams and as selected by the encoder in 3D-AVC bitstreams). 3DAVC bitstreams can include a selected number of AVC/MVC compatible texture views. Furthermore, 3DVTM bitstream3DAVC bitstreams can include a selected number of depth views that are coded using the coding tools of the AVC/MVC standard only. The other texture views (aica. enhanced texture views) of an 3DVATM3DAVC bitstream may he jointly predicted from the texture and depth views and/or the other depth views of an 3DVC bitstream may use depth coding methods not induded hi the AVC/MV/MVC+D standard presenfly. A decoder capable of decoding afl views from 3DV-ATM bftstreams may be caHed as a 3DV-ATM decoder.

kiter-component prediction may be defined Lo comprise prediction of syntax element values, sample values, variable values used in the decoding process, or anything alike from a component picture of one type to a component picture of another type. For example, inter-component prediction may comprise prediction of a texture view component from a depth view component, or vice versa.

Figure 26 shows an example processing flow for depth map coding for example in 3DV-ATM. In the figure, joint texture-depth map filtering within a coding loop of compression algorithms, for example within H.2$4/AVC or MVC coding loops is shown. While in Figure 26, inter-component prediction from a depth to texture is used for predicting or inferring parameters for in-loop filter, it needs to be understood that Figure 26 is provided merely as an example and other coding structures and processing flows could use other types of inter-component prediction from depth to texture and/or texture to depth, such as motion information prediction. In a joint texture-depth map filtering approach, the fUtering of depth images may be applied in the coding loop, In this approach, edge-preserving structural information extracted from the textural/color information may be used to configure the filtering operations over the depth map data.

The filtered depth images may be stored as reference pictures for inter and inter-view prediction of other depth images. The embodiments may be realized in hybrid video coding schemes, e.g. within H.264/AVC, MVC or future video coding standard which is based on hybrid video coding approach or any other coding approach where inter prediction (also known as motion estimation and motion compensation) is used.

Identical or approximately identical joint texture/depth map filtering may be implemented at the decoder side. Alternativ&y, the decoder may implement another joint texture/depth map filtering that produces identical or close to identical results compared to the encoder lifter. This may prevent propagation of prediction loop error in the depth map images.

Figure 26 comprises a coding loop for encoding textural data and a coding loop for encoding depth map data. In the texture coding 1oop (top loop), texture video data X is input to the encoder. The texture video data may be multiview video data such as stereo video for 3D video viewing. In the encoder, the video data X is input to a motion compensated prediction block MCP and a motion estimation block ME for use in prediction. Together, these Hocks create a prediction X for the next video frame, Motion estimation information (motion vectors) are also sent to the entropy encoder.

This predicted data X is subtracted from the input video (block by block), and the residual error is then transformed in block T, for example with discrete cosine transform, and quantized. The transformed and quantized (block Q) residual error is one input to the entropy encoder. The transformed and quanbzed residual error from point 1011 may be used as input to controlling the loop filter of depth map coding. The transformed and quantized residual error is then dequantized (block Q1) and an inverse transform is applied (block T1).

Predicted data X' is then added to this residual error, and a loop fUter is applied e.g. to reduce blocking artifacts. The loop fUtered image is then given to the ME and MCP blocks as input for the prediction of the next image. The loop filtered image at point 1012 and motion estimation information at point 1014 may be given to the feature extractor as input. In some embodiments, the image 1016 prior to loop filtering may be given to the feature extractor as input in addition to or instead of the oop filtered image.

The entropy encoder encodes the residual error data and the motion estimation data in an efficient manner e.g. by applying variable-length coding. The encoded data may be transmitted to a decoder or stored for playback, for example.

The encoding of depth map data may happen in a coding loop with similar elements than the one for texture video data. The depth map data Z undergoes motion estimation ME and motion compensated prediction MCP, the residual error is transformed and quantized, dequantized arid inverse transformed, and finally ioop filtered. The ioop filter, or another filter such as a post$ilter or a pre-filter, is adapted and/or controlled by using lii parameters and features from the texture encoding points 1011, 1012 and 1014, and/or others. A feature extractor may be used to extract features. The feature extractor may be a separate block, or it may be a block in the texture cothng loop, or it may be a block in the depth map coding loop. The loop fiRer provides a smoother depth map (after transform T, quantization Q, dequanUzation Q and inverse transform y'1) that may serve as a better basis for motion compensation prediction. The depth map information (residual error and motion estimation information) are sent to the entropy encoder. The encoded data may be transmitted to a decoder or stored for playback, for example.

Feature extraction may be performed after the coding/decoding of a texture picture or alter coding/decoding a part of a texture picture, such as a slice or a macroblock, Similarly, joint texture/depth map filtering may be done after coding/decoding a depth picture or a part of it, such as a slice or a macroblock, Feature extraction in smafler units than a picture may enable paralielization of texture picture and depth picture coding/decoding, Joint texture/depth map filtering facilitates parallel processing and may enable the use of filtered picture areas for intra prediction.

In some depthenhanced video coding and bitstreams, such as MVC÷D, depth views may refer to a differenty structured sequence parameter set, such as a subset SPS NAL unit, than the sequence parameter set for texture views. For example, a sequence parameter set for depth views may include a sequence parameter set 3D video coding (3DVC) extension. When a different SPS structure is used for depth-enhanced video coding, the SPS may be referred to as a 3D video coding (3DVC) subset SPS or a 3DVC SPS, for example. From the syntax structure point of view, a 3DVC subset SPS may be a superset of an SPS for multiview video coding such as the MVC subset SPS.

A depth-enhanced multiview video bitstream. such as an MVC+D bitstream, may contain two types of operation points: multiview video operation points (ag, MVC operation points for MVC+D bitstreams) and depth-enhanced operation points.

Multiview video operation points consisting of texture view components only may be specified by an SPS for multiview video, for example a sequence parameter set MVC A £4,, extension included in an SPS referred to by one or more texture views. Depth-enhanced operation points may be specified by an SF8 for depth-enhanced video, for example a sequence parameter set MVC or 3DVC extension included in an SPS referred to by one or more depth views.

A depth-enhanced multiview video bitstream may contain or be associated with multiple sequence parameter sets, e.g. one for the base texture view, another one for the non-base texture views, and a third one for the depth views, For example, an MVC+D bitstream may contain one SPS NAL unit (with an SF8 identifier equal to e.g. 0), one MVC subset SPS NAL unit (with an SPS identifier equal to e.g. 1), and one 3DVC subset SPS NAL unft (with an SPS identifier equal to e.g. 2), The first one is distinguished from the other two by NAL unit type, while the latter two have different profiles, i.e., one of them indicates an MVC profile and the other one indicates an MVC÷D profile.

The coding and decoding order of texture view components and depth view components may be indicated for example in a sequence parameter set. For example, the following syntax of a sequence parameter set 3DVC extension is used in the draft

3D-AVC specification (MPEG N12732):

seq parameter set 3dvc extension( ) { C Oescript __________________ or "dp hjnfo_presentjlag 10 u(1) if( depth info present flag) { for( i = 0; i num views minusi; i++) depth preceding texture flag[ i] 0 u(1) The semantics of depth preceding texture flag[ i] may be specified as follows.

depth preceding texture flag[ i] specifies the decoding order of depth view components in relation to texture view components. depth preceding texture flag[ i equal to 1 indicates that the depth view component of the view with view idx equal to i .1 A A i precedes the texture view component of the same view in decoding order in each access unit that contains both the texture and depth view components.

depth preceding texture flag[ I equa' to 0 indicates that the texture view component of the view with view idx equal to i precedes the depth view component of the same view in decoding order in each access unit that contains both the texture and depth view corn ponents.

The depth representation information SEI message of a draft MVC÷D standard (JCT3V document JCT2A1 001), presented in the foUowing, maybe regarded as an example of how information about depth representation format may he represented. The syntax of the 56 message is as foHows: depth represention inforrnation( payloadSize) { C Descriptor depth representation type 5 ue(v) aviews equal flag 5 u(1) if( a views equalfiag = 0){ num views minusi 5 ue(v) numViews num views minusi + 1 }else{ nurnViews = 1 ior( = 0; i c numViews; i++ ) { depth representation base view id[i] 5 ue(v) if ( depth representation type == 3) { depth nonlinear representation nurn minusi ue(v) depth nonlinear representation num = depth nonlinear representation num minusi +1 for( = 1; a' depth nonlinear representation num; ++ ) depth noni inear representation mnodel[ i] ue(v) The semantics of the depth representaUon SB message may be specified as foflows.

The syntax elements in the depth representation information SD message specifies various depth representation for depth views for the purpose of processing decoded texture and depth view components prior to rendering on a 3D dispay, such as view synthesis. It is recommended, when present, the SD message is associated with an DR access unit for the purpose of random access. The information signaled in the 98 message applies to aU the access units from the access unit the SB message is associated with to the next access unit, in decoding order, containing an SB message of the same type, exdusively, or to the end of the coded video sequence, whichever is earlier in decoding order.

Continuing the exemplary semantics of the depth representation SB message.

depth representation type specifies the representation definition of uma pixels in coded frame of depth views as specified in the table below, In the table below, disparity specifies the horizontal displacement between two texture views and Z value specifies the distance from a camera.

depth representation type Interpretation -i:e coded fram of depth Views C) represents an inverse of Z vaue normalized in range from U to Each luma pixel vaiue in coded frame of depth views I represents disparity normalized in range from 0 to 255 Each luma pixel value in coded frame of depth views 2 represents Z value normalized in range from 0 to 255 us ______________-Ech luma pixel value in coded frame of depth views 3 represents nonlinearly mapped disparity, normalized In range from 0to255.

Continuing the exemplary semantics of the depth representation SEI message, aILvlew.equaIj1ag equal to 0 specIfies that depth representation base view may not be identical to respective values for each view in target views. aiLvlews,equaflag equal to I specifies that the depth representation base views are identical to respective values for all target views. depthjepresentalon_basejtlewJdDj specifies the view identifier for the NAL unit of either base view which the disparity for coded depth frame of i-tb vlewJd is derived from (depth_representationjype equal to 1 or 3) or base view which the Z-axls for the coded depth frame of I-Ui view Jd is defined as the optical axis of (depthjepresentatlonjype equal too or 2).

depth_nonlinearjepresentadonjium,minusl + 2 specifies the ntxnber of piecewise linear segments for mapping of depth values to a scale that is uniformly quantized in terms of disparity. depth_nonlinear_representatlon.madel[ 1] specIfies the piecewise linear segments for mapping of depth values to a scale that is uniformly quantized in terms of disparity. When depthjepresentationjype is equal to 3, depth view component contains nonlineaily transformed depth samples. Variable DepthLUT (I], as specified below, Is used to transform coded depth sample values from nonlinear representation to the linear representation -disparity normalized in range fran, 0 to 255.

The shape of this transform is defined by means of line-segment-approximation in two-dimensional linear-dIsparity-to-nonlinear-disparity space. The first (0, 0) and the last (255, 255) nodes of the curve are predefined. Positions of additional nodes are transmitted in form of deviations (depthnonlinear_representation_modei[ I]) from the straight-line curve. These deviations are uniformly distributed along the whole range of 0 to 255, inclusive, with spacing depending on the value of nonllnear depth representation urn.

Variable DepthLUT[ I] for I In the range of 0th 255, induslve, is specified as follows.

depth nonflnear representation model[ 0] = 0 depth non nearjepresentation model[depth nonUnear representation num + 1] = 0 for( k=0; kc= depth nonlinearrepresentation num; 4-i-k) pod = ( 255 + k) / (depth nonlinear representation num + I dcvi = depth nonlinear representation model[ k] pos2 = (255 * ( k+1) ) / (depth nonUnear representation num + 1)) dev2 = depth nonnear representation model[ k+i xl posi -dcvi yl = posi + dcvi x2 = pos2 -dev2 y2 = pos2 + dev2 for ( x = max( xl, 0); x < rnin( x2, 255); ++x) DepthLUT[ x] = CUp3( 0, 255, Round( ((xx1)*(y2yl))+(x2xl)+y1)) In a scheme referred to as unpah-ed mufiview videopIu&depth (MVD), there may be an unequal number of texture and depth views, and/or some of the texture views might not have a co4ocated depth view, and/or some of the depth views might not have a co located texture view, some of the depth view components might not be temporaRy coinciding with texture view components or vice versa, co-located texture and depth views might cover a different spatial area, and/or there may be more than one type of depth view components. Encoding, decoding, and/or processing of unpaired MVD signal may be faciUtated by a depth-enhanced video coding, decoding, arid/or processing scheme.

Terms colocated, collocated, and overlapping may be used interchangeably to indicate that a certain sample or area in a texture view component represents the same physical objects or fragments of a 3D scene as a certain co-bcated/coflocated/overlapping sample or area in a depth view component. In some embodiments, the sampling grid of a texture view component may be the same as the sampling grid of a depth view component, i.e. one sample of a component image, such as a luma image, of a texture view component corresponds to one sample of a depth view component. i.e. the physic& dimensions of a sample match between a component image, such as a luma image, of a texture view component and the corresponding depth view component. In some embodiments, sample dimensions (twidth x theight) of a sampling grid of a component image, such as a luma image, of a texture view component may be an integer multiple of sample dimensions (dwidth x dheight) of a sampflng grid of a depth view component, La twidth = m x dwidth arid theight = n dheight, where m and n are positive integers. In some embodiments, dwidth = m x twidth and dheight = n x U eight, where m and n are positive integers. In some embodiments, twidth = m x dwidth and th&ght = n x dheight or alternativ&y dwidth = m x twidth and dheight = n x theight, where rn and n are positive values and may be non-Integer. In these embodiments, an interpolabon scheme may be used in the encoder arid in the decoder and in the view synthesis process and other processes to derive co-located sample values between texture and depth. In some embodiments, the physical position of a sampling grid of a component image, such as a luma image, of a texture view component may match that of the corresponding depth view and the sample dimensions of a component image, such as a luma image, of the texture view component may be an integer multiple of sample dimensions (dwidth x dheight) of a sampling grid of the depth view component (or vice versa) -then, the texture view component and the depth view component may be considered to be co-located and represent the same viewpoint. In some embodiments, the position of a sampling grid of a component image, such as a luma image, of a texture view component may have an integer-sample offset relative to the sampling grid position of a depth view component, or vice versa. In other words, a top-left sample of a sampling grid of a component image, such as a luma image, of a texture view component may correspond to the sample at position (x, y) in the sampling grid of a depth view component, or vice versa, where x and y are non-negative integers in a two-dimensional Cartesian coordinate system with non-negative values only and origo in the top-left corner. In some embodiments, the values of x and/or y may be non integer and consequenfly an interpolation scheme may be used in the encoder and in the decoder and in the view synthesis process and other processes to derive co-located sample vSues between texture and depth. In some embodiments, the sampling grid of a component image, such as a uma image, of a texture view component may have unequal extents compared to those of the sampling grid of a depth view component. In other words, the number of samples in horizontal and/or verucal direction in a sampUng grid of a component image, such as a luma image, of a texture view component may differ from the number of samples in horizontal and/or vertical direction, respectively, in a sampling grid of a depth view component and/or the physical width and/or height of a sampling grid of a component image, such as a luma image, of a texture view component may differ from the physical width and/or height, respectively, of a sampling grid of a depth view component. In some embodiments, non-uniform and/or non-matching sample grids can be utihzed for texture and/or depth component. A sample grid of depth view component is non-matching with the sample grid of a texture view component when the sampling grid of a component image, such as a luma image, of the texture view component is not an integer multiple of sample dimensions (dwidth x dheight) of a sampling grid of the depth view component or the sampling grid position of a component image, such as a uma image, of the texture view component has a non-integer offset compared to the sampling grid position of the depth view component or the sampling grids of the depth view component and the texture view component are not aligned/rectified. This could happen for example on purpose to reduce redundancy of data in one of the components or due to inaccuracy of the calibration/rectification process between a depth sensor and a colour image sensor.

A coded depth-enhanced video bitstream, such as an MVC+D bitstream or an 3D-AVG bitstream, may be considered to include different types of operation points: texture video operation points, such as MVC operation points, texture-plus-depth operation points including both texture views and depth views, and depth video operation points including only depth views. An MVC operation point comprises texture view components as specified by the SPS MVC extension. The texture plus depth operation points may be paired or unpaired. In pafted texture-plus-depth operation points, each view contains both a texture depth and a depth view (if both are defined in the 3DVC subset SPS by the same syntax structure as that used in the SPS MVC extension, originaHy present in the bitstream). In unpaired texture-pius-depth operation points, it is specified whether a texture view or a depth view or both are present in the operation point for a particular view.

The coding and/or decoding order o Lexture view components and depth view components may determine presence of syntax elements related to inter-component prediction and aHowed values of syntax elements relaLed to inter-component prediction.

In the case of joint coding of texture and depth for depth-enhanced video, view synthesis can be utthzed in the loop of the codec, thus providing view synthesis prediction (VSP). In VSP, a prediction signaL such as a VSP reference picture, is formed using a DIBR or view synthesis algorithm, utilizing texture and depth information. For example, a synthesized picture (i.e., VSP reference picture) may be introduced in the reference picture list in a similar way as it is done with interview reference pictures and inter-view only reference pictures. Alternatively or in addition, a specific VSP prediction mode for certain prediction blocks may be determined by the encoder, indicated in the bitstream by the encoder, and used as concluded from the bitstream by the decoder, In MVC, both inter prediction and inter-view prediction use similar motion-compensated prediction process. Inter-view reference pictures and inter-view only reference pictures are essentially treated as long-term reference pictures in the different prediction processes. Similarly, view synthesis prediction may be realized in such a manner that it uses essentially the same motion-compensated prediction process as inter prediction and inter-view predictionS To differentiate from motion-compensated prediction taking place only within a single view without any VSP, motion-compensated prediction that includes and is capable of flexibly selecting mixing inter prediction, inter-prediction, and/or view synthesis prediction is herein referred to as mixed-direction motion-compensated prediction.

As reference picture lists in MVC and an envisioned coding scheme for MVD such as 3DVATM and in similar coding schemes may contain more than one type of reference pictures, La inter reference pictures (aOso known as intraview reference pictures), interfl view reference pictures, inter-view only reference pictures, and VSP reference pictures, a term prediction direction may be defined to incfloale the use of intra-view reference pictures (temporal prediction), inter-view prediction, or VSP. For example. an encoder may choose for a specific block a reference index that points to an inter-view reference picture, thus the prediction direction of the block is inter-view.

A VSP reference picture may also be referred to as a synthetic reference component, which may be defined to contain samples that may be used for view synthesis prediction. A synthetic reference component may be used as a reference picture for view synthesis prediction but may not he output or displayed. A view synthesis picture may be generated for the same camera location assuming the same camera parameters as for the picture being coded or decoded, A view-synthesized picture may be introduced in the reference picture list in a similar way as is done with inter-view reterence pictures. Signaling and operations with reference picture list in the case of view synthesis prediction may remain identical or similar to those specified in H264/AVC or HEVC.

A synthesized picture resifiting from VSP may be included in the initial reference picture lists Listo and Listi for example foUnwing temporal and inter-view reference frames.

However, reference picture list modification syntax (La, RPLR commands) may be extended to support VSP reference pictures, thus the encoder can order reference picture lists at any order, indicate the final order with RPLR commands in the bitstream.

causing the decoder to reconstruct the reference picture lists having the same final order.

Processes for predicting from view synthesis reference picture, such as motion information derivation, may remain identical or similar to processes specified for inter, inter-layer, and inter-view prediction of H.24/AVC or HEVC. Alternativ&y or in addition, specific coding modes for the view synLhesis prediction may be specified and &gnaled by the encoder in the bitstream. In other words, VSP may alternatively or also be used in some encoding and decoding arrangements as a separate mode from intra, inter, inter-view and other coding modes, For example, in a VSP skip/direct mode the motion vector difference (de)coding and the (de)coding of the residual prediction error for example using transform-based coding may also be omitted. For example, if a macroblock may be indicated within the bitstream to be coded using a skip/direct mode, it may further be indicated within the bitstream whether a VSP frame is used as a reference. Alternatively or in addition, view-synthesized reference blocks, rather than or in addition to complete view synthesis reference pictures, may be generated by the encoder and/or the decoder and used as prediction reference for various prediction processes.

To enable view synthesis prediction for the coding of the current texture view component, the previously coded texture and depth view components of the same access unit may be used for the view synthesis. Such a view synthesis that uses the previously coded texture and depth view components of the same access unit may be referred to as a forward view synthesis or forward-projected view synthesis. and sirnflarly view synthesis prediction using such view synthesis may be referred to as forward view synthesis prediction or forward-projected view synthesis prediction.

Forward View Synthesis Prediction (VSP) may be performed as foUows, View synthesis may be implemented through depth map (d) to disparity (D) conversion with foUowing mapping pixels of source picture sfr,y in a new pixel location in synthesized target image t(x+Dcy).

r(x + Djy) = s(x, y) D(s,y)) ,L ( ( = £c1c2[ I - ___ 255 Z.cr Zy) Z1,) (2) In the case of projection of texture picture, s(x,y) is a sample of texture image, and d(sx.y)) is the depth map value associated wfth s(x,y).

In the case of projection of depth map values, s(xy=d(x,y and this sample is projected using its own value d(s(x,y= d(,y.

The forward view synthesis process may comphse two conceptual steps: forward warping and hole fiUing. In forward warping, each pix& of the reference image is mapped to a synthesized image. When multiple pixels from a reference frame are mapped to the same sample location in the synthesized view, the pixel assodated with a larger depth value (closer to the camera) may be selected in the mapping competition.

After warping afl pixels, there may be some hole pixels left with no sample values mapped from the reference frame, and these hole pixels may be filled in for example with a line-based directional hole filling, in which a "hole" is defined as consecutive hole pixels in a horizontal line between two non-hole pixels. Hole pixels may be filled by one of the two adjacent non-hole pixels which have a smaller depth sample value (farther from the camera).

In a scheme referred to as a backward view synthesis or backward-projected view synthesis, the depth map co-located with the synthesized view is used in the view synthesis process. View synthesis prediction using such backward view synthesis may be referred to as backward view synthesis prediction or backward-projected view synthesis prediction or B-VSP. To enable backward view synthesis prediction for the coding of the current texture view component. the depth view component of the currently coded/decoded texture view component is required to be avaUable, In other words, when the coding/decoding order of a depth view component precedes the coding/decoding order of the respective texture view component, backward view synthesis prediction may be used in the coding/decoding of the texture view component.

With the B-VSP, texture pixels of a dependent view can be predicted not from a synthesized VSP4rame, but directly from the texture pixels of the base or reference vew. Displacement vectors required for this process may be produced from the depth map data of the dependent view, La. the depth view component corresponding to the texture view component currently being coded/decoded.

The concept of BVSP may he explained with reference to Figure 20 as foflows, Let us assume that the foUowing coding order is utilized: (TO, DO, Dl, Ti). Texture component TO is a base view and TI is dependent view coded/decoded using 3-VSP as one prediction tool. Depth map components DO and Dl are respective depth maps associated with TO and Ti, respectively. lii dependent view Ti, sample values of currently coded block Ch may he predicted from reference area R(Cb) that consists of sample values of the base view TO. The displacement vector (motion vector) between coded and reference samples may be found as a disparity between TI and TO from a depth map value associated with a currently coded texture sample.

The process of conversion of depth (liz) representation to disparity may he performed for example with the foflowing equations: __L +-D(chN,il= P 25) Zncar 4j) Zjar z(cij.ifl (3) where j and i are local spatial coordinates within Cb, d(CbQJ)) is a depth map value in depth map image of a view #1, Z is its actual depth value, and D is a disparity to a particular view #0. The parameters f, b, Znear and Zfar are parameters specifying the camera setup; i.e. the used focal length (f), camera separation (b) between view #1 and view #0 and depth range (ZnearZ1ar) representing parameters of depth map conversion.

A coding scheme for unpaired MVD may for example include one or more of the following aspects: Encoding one or more indications of which ones of the input texture and depth views are encoded, inter-view prediction hierarchy of texture views and depth views, and/or AU view component order into a bitstream.

As a response of a depth view required as a reference or input for prediction (such as view synthesis predicfion, inter-view prediction, inter-component prediction, and/or alike) and/or for view synthesis performed as post-processing for decoding and the depth view not input to the encoder or determined not to be coded, performing the foflowing: Deriving the depth view, one or more depth view components for the depth view, or parts of one or more depth view components for the depth view on the bask of coded depth views and/or coded texture views and/or reconstructed depth views and/or reconstructed texture views or parts of them. The derivation may be based on view synthesis or DIBR, for example.

Using the derived depth view as a reference or input for prediction (such as view synthesis prediction, inter-view prediction, intercomponent prediction, and/or alike) and/or for view synthesis performed as post-processing for decoding.

Inferring the use of one or more coding tools, modes of coding tools, and/or coding parameters for coding a texture view based on the presence or absence of a respective coded depth view and/or the presence or absence of a respective derived depth view, In some embodiments, when a depth view is required as a reference or input for prediction (such as view synthesis prediction, inter-view prediction, inteN component prediction. and/or ahke) but is not encoded, the encoder may derive the depth view; or infer that coding took causing a depth view to be required as a reference or input for prediction are turned off; or select one of the above adaptively and encode the chosen option and related parameter values, if any, as one or more indications into the bitstream.

Forming an inter-component prediction signal or prediction block or alike from a depth view component (or, generally from one or more depth view components) to a texture view component (or, generally to one or more texture view components) for a I:..

subset of predicted blocks in a texture view component on the basis of availability of co- located samples or blocks in a depth view component. Similarly, forming an Inter-component prediction signal or a prediction block or alike from a texture view component (or, generally from one or more texture view components) to a depth view component (or, generally to one or more depth view components) for a subset of predicted blocks In a depth view component on the basis of availability of co-located samples or blocks in a texture view component.

Forming a view synthesis prediction signal or a prediction block or alike for a texture block on the basis of availability of co-located depth samples.

A decoding scheme for unpaired MVD may for example include one or more of the following aspects; Receiving and decoding one or more indications of coded texture and depth views, inter-view prediction hierarchy of texture views and depth views, and/or AU view component order from a bitstream.

When a depth view required as a reference or input for prediction (such as view synthesis prediction, inter-view prediction, inter-component prediction, and/or alike) but not included in the received bitstream, deriving the depth view; or Inferring that coding tools causing a depth view to be required as a reference or input for prediction are turned off; or selecting one of the above based on one or more indications received and decoded from the bitstream.

Inferring the use of one or more coding tools, modes of coding tools, and/or coding parameters for decoding a texture view based on the presence or absence of a respective coded depth view and/or the presence or absence of a respective derived depth view.

Forming an inter-component prediction signal or prediction block or alike from a depth view component (or, generally from one or more depth view components) to a texture view component (or, generally to one or more texture view components) for a subset of predicted blocks in a texture view component on the basis of availability of co-located samples or blocks in a depth view component. Similarly, forming an inter component prediction signal or prediction block or aflke from a texture view component (or, generafly from one or more texture view components) to a depth view component (or, generafly to one or more depth view components) for a subset of predicted blocks in a depth view component on the basis of availability of coocated samples or blocks in a texture view component.

Forming a view synthesis prediction signal or prediction block or aUke on the basis of availabflity of coocated depth samples.

When a depth view required as a reference or input for prediction for view synthesis performed as postprocessing, deriving the depth view.

Determining view components thaL are not needed for decoding or output on the basis of mentioned signalhng and configuring the decoder to avoid decoding these unnecessary coded view components.

Coded and/or decoded depth view components may be used for example for one or more of the foflowing purposes: i) as prediction reference for other depth view components, ii) as prediction reference for texture view components for example through view synthesis prediction, iii) as input to DIBR or view synthesis process performed as postprocessing for decoding or preprocessing for rendering/displaying.

In many cases, a distortion in the depth map causes an impact in a view synthesis process, which may be used for view synthesis prediction and/or view synthesis done as postprocessing for decoding. Thus, in many cases a depth distortion may he considered to have an indirect impact in the visual quality/fidelity of rendered views and/or in the quality/fidelity of prediction signal. Decoded depth maps themselves might not be used in applications as such, e.g. they might not be displayed for en&users. The abov&mentioned properties of depth maps and their impact may be used for rate distortionoptimized encoder control. Ratedistortionoptirnized mode and parameter selection for depth pictures may be made based on the estimated or derived quaUty or fidelity of a synthesized view component. Moreover, the resulting ratedistortion performance of the texture view component (due to depthbased prediction and coding tools) may be taken into account in the mode and parameter selection for depth pictures. Several methods for rate-distortion optimization of depth-enhanced video coding have been presented that take into account the view synthesis fidehty. These methods may be referred to as view synthesis optimization (VSO) methods.

Video compression is commonly achieved by removing spatial, frequency, andlor temporal redundancies, Different types of prediction and quantization of transform-domain prediction residuals may be used to exploit both spalial and temporal redundancies, In addition, as coding schemes have a practical limit in the redundancy that can be removed: spatial and temporal sampUng frequency as well as the bit depth of samples can be selected in such a manner that the subjective quaflty is degraded as little as possible.

One potential way for obtaining compression improvement in stereoscopic video is an asymmetric stereoscopic video coding, in which there is a quality difference between two coded views. This is attributed to the widely believed assumption of the binocular suppression theory that the Human Visual System (HVS) fuses the stereoscopic image pair such that the perceived quality is close to that of the higher quality view.

Asymmetry between the two views can be achieved e.g. by one or more of the following methods: Mixed-resolution (MR) stereoscopic video coding, which may also be referred to as resolution-asymmetric stereoscopic video coding, in which one of the views is low-pass filtered and hence has a smaller amount of spatial details or a lower spatial resolution. Furthermore, the low-pass filtered view may be sampled with a coarser sampling grid, i.e., represented by fewer pixels.

Mixed-resolution chroma sampling, in which the chroma pictures of one view are represented by fewer samples than the respective chrome pictures of the other view.

Asymmetric sample-domain quantization. in which the sample values of the two views are quantized with a different step size. For example, the luma samples of one view may be represented with the range of 0 to 255 (i.e., 8 bits per sample) while the range may be scaled e.g. to the range of 0 to 159 for the second view. Thanks to fewer quantization steps, the second view can be compressed with a higher ratio compared to the first view. Dftferent quantization step sizes may be used for luma and chroma samples. As a special case of asymmetric sample-domain quantization, one can refer to bit-depth-asymmetric stereoscopic video when the number of quantization steps in each view matches a power of Iwo, Asmmetric transform-domain quantization, in which the transform coefficients of the two views are quantized with a different step size. As a result, one of the views has a lower fideUty and may be subject to a greater amount of visible coding artifacts, such as blocking and ringing.

A combination of different encoding techniques above may also be used.

The aforementioned types of asymmetric stereoscopic video coding are iHustrated in Figure 12, The first row (12a) presents the higher quality view which is only transform-coded. The remaining rows (12b 12e) present several encoding combinations which have been investigated to create the lower quality view using different steps, namely, downsampUng, sample domain quantization, and transform based coding. It can be observed from the figure that downsampling or sample-domain quantization can be applied or skipped regardless of how other steps in the processing chain are applied.

Likewise, the quantization step in the transform-domain coding step can be selected independently of the other steps. Thus, practical realizations of asymmetric stereoscopic video coding may use appropriate techniques for achieving asymmetry in a combined manner as illustrated in Fig. 12e.

In addition to the aforementioned types of asymmetric stereoscopic video coding, mixed temporal resolution (i.e., different picture rate) between views may also be used.

In multiview video coding, motion vectors of different views may be quite correlated as the views are captured from cameras that are sflghtly apart each other. Therefore, utilizing motion data of one view for coding the other view may improve the coding efficiency of a multiview video coder.

Muftiview video coding may be reahzed in many ways. For example, multiview video coding may be reaUzed by only introducing high lev& syntax changes to a single layer video coder without any changes below Lhe macroblock (or coding tree block) layer. In this highlev& only multiview video coder, the decoded pictures from different views may be placed in the decoded picture buffer (DPB) of other views and treated as a regular reference picture.

Temporal motion vector prediction process may be used to exploit the redundancy of motion data between different layers. This may be done as foUows: when the base layer is upsampled the motion data of the base layer is also mapped to the resolution of an enhancement layer, If the enhancement layer picture utilizes temporal motion vector prediction from the base layer picture, the corresponding motion vector predictor is originated from the mapped base layer motion field. This way the correlation between the motion data of different layers may be exploited to improve the coding efficiency of a scalable video coder.

This kind of motion mapping process may be useful for mapping motion fields between layers of different resolutions, but may not work for multi'Mew video coding.

In an nterview motion skip or prediction mode for multiview video coding correlations of motion data existing between different views may be exploited. If this mode is enabled, motion data of the corresponding block may be calculated using the motion information from a different view. This calculation may involve first finding the corresponding motion blocks in another view due to disparity, and performing a pre defined operation on the corresponding motion blocks. Due to the new mode, this approach may not be suitable for a highlevel syntax only muItiview video coder.

It may also he possible to use motion of one view to predict motion of another view by establishing a correspondence between a block in one view and a block in a reference view. This may be done by estimating a depth map, either based on already transmitted depth data or by using transmitted disparity vectors. EstabUshing the correspondence may be implemented in a highievel switax only coder.

Different measure may he derived from a block of depth samples cb_d, some of which are presented in the following. The depth/disparity information can be aggregatively presented through average depth/disparity values for cbd and deviation (e.g. variance) of cbd. The average Av(ch d) depth/disparity value for a block of depth information cbd may be computed as: Av(cbd) = sum(ch d(x,.y))/num pixels (4) where x and y are coordinates of the pixels in cbd, and num pixels is number of pixels within cbd, and function sum adds up all the sample/pixel values in the given block, i.e. function sum(block(x,y)) computes a sum of samples values within the given block for all values of x and y corresponding to the horizontal and vertical extents of the block.

The deviation Dev(cbd) of the depth/disparity values within a block of depth information cbd can be computed as: Dev(cbd) = surn(abs(cb d(,y) Av(cb d)fl/num pixels (5) where function abs returns the absolute value of the value given as input.

The following may be used to determine if a block of depth data cbd represents homogenous: If Dev(cb d) =< TI1 cbd = homogenous data (6) where TI may be an application-specific predefined threshold TI and/or may be indicated by the encoder in the bitstream, In other words, if the deviation of the depth/disparity values within a block of depth information cbd is less than or equal than the threshold TI, such cbd block can be considered as homogenous.

The similarity of two depth blocks (of the same shape and number of pixels), cbd and nbd, may he compared for example in one or more of the following ways. One way is to compute an average pixelwise deviation (difference) for example as follows: nsad(cb,,d, nbd) = sum(abs( b d(x,y) nd(x1y)))/num,,,pixels (7) where x and y are coordinates of the pixels in cbd and nbd numpixels is number of pixels within cbd and functions sum and abs are defined above. This equation may also be regarded as a sum of absolute differences (SAD) between the given depth blocks normalized by the number of pixels in the block.

In another example of a similarity or distorlion metric, a sum of squared differences (550) normalized by the number of pixels may be used as computed below: nsse(cbd, nbd) sum( (cb d(x,y) nbd(x,yflA2) / num pixels (8) where x and y are coordinates of the pixels in cbd and in its neighboring depth/dispariLy block (nbd), numpix&s is number of pixels within cbd, notation A2 indicates a power of two, and function sum is defined above.

In another example, a sum of transformed differences (SATO) may be used as a similarity or thstortion metric. Both the current depth/disparity block cbd and a neighbouring depth/disparity block nbd are transformed using for example DCI or a variant thereof, herein marked as function TO. Let tcbd be equal to T(cbd) and tnbd be equal to T(nb_d). Then, either the sum of absolute or squared differences is calculated and may be normalized by the number of pixels/samples, num_pixels, in cbd or nbd, which is also equal to the number of transform coefficients in tcbd or tnbd, In the following equation, a version of sum of transformed differences using sum of absolute differences is given: nsatd(cbd, nbd) sum( ahs(tcb d(x,y) -tnb d(x,y)) ) / numpixels (9) Other distorUon metrics, such as the structural similarity index (SSIM), may also be used for the derivation the similarity of two depth blocks.

The similarity or distortion metric might not perfonned for all sample location of cbd and nbd but only for selected sample locations, such as the four corner samples.

and/or cbd and nb_d may be downsampled before performing the similarity or distortion metric computation.

Function diff(cb d, nhd) may be defined as follows to enable access any similarity or distortion metric: diff(cbd, nbd) = nsad(cbd, nbd), if sum of absolute differences is used nsse(cbd, nbd), if sum of squared differences is used nsatd(cbd, nbd), if sum of transformed absolute differences is used (10) Any simarity/distortion metric could be added to the definition of function diff(cbd, nbd). In some embodiments, the used similarity/distortion metric is pre-defined and therefore stays the same in both the encoder and the decoder. In some embodiments, the used similarity/distortion metric is determined by the encoder, for example using ratedistortion optimization, and encoded in the bitstream as one or more indications.

The indication(s) of the used similarity/distortion metric may be included for example in a sequence parameter set, a picture parameter set, a sUce parameter set, a picture header, a slice header, within a macroblock syntax structure, and/or anything &ike, In some embodiments, the indicated similarity/distortion metric may be used in pre determined operations in both the encoding and the decoding loop, such as depth/disparity based motion vector prediction, In some embodiments, the decoding processes for which the indicated similarity/distortion metric is indicated are also indicated in the bitstream for example in a sequence parameter set, a picture parameter set, a slice parameter set, a picture header, a slice header, within a macroblock syntax structure, or anything alike. In some embodiments, it is possible to have more than one pair of indications for the depth/disparity metric and the decoding processes the metric is applied to in a the bitstream having the same persistence for the decoding process, i,e. applicable to decoding of the same access units. The encoder may select which simUarity/distortion metric is used for each particular decoding process where a similarity/distortion based selection or other processing is used, such as depth/disparity based motion vector prediction, and encode respective indications of the selected disparity/distortion metrics and to which decoding processes they apply to into the bitstream, When the &mflarfty of disparity blocks is compared, the viewpoints of the blocks may be normaUzed, e.g. so that the disparity values are scaled to result from the same camera separation in bath compared blocks, A scalable video coding and/or decoding scheme may use multkoop coding and/or decoding, which may be characterized as foUows, In the encoding/decoding, a base layer picture may be reconstructed/decoded to be used as a motion-compensation reference picture for subsequent pictures, in coding/decoding order; within the same layer or as a reference for interlayer (or inter-view or inter-component) prediction. The reconstructed/decoded base layer picture may be stored in the DPB. An enhancement layer picture may likewise be reconstructed/decoded to be used as a motion-compensation reference picture for subsequent pictures, in coding/decoding order, within the same layer or as reference for inter-layer (or inter-view or inter-component) prediction for higher enhancement layers, if any. In addifion to reconstructed/decoded sample values, syntax element values of the base/reference layer or variables derived from the syntax element values of the base/reference layer may be used in the inter-layer/inter-component/inter-view prediction.

A scalable video encoder for quality scalability (also known as Signal-to-Noise or SNR) and/or spatial scalability may be implemented as follows, For a base layer, a conventional non-scalable video encoder and decoder may be used. The reconstructed/decoded pictures of the base layer are included in the reference picture buffer and/or reference picture lsts-*for an enhancement layer. In case of spatial scalahility, the reconstrucLed/decoded base-layer picture may be upsampled prior to its insertion into the reference picture isis for an enhancement-layer picture. The base layer decoded pictures may be inserted into a reference picture list(s) for coding/decoding of an enhancement layer picture simHarly to the decoded reference pictures of the enhancement layer. Consequently, the encoder may choose a base-layer reference picture as an inter prediction reference and indicate its use with a reference picture index in the coded bitstream. The decoder decodes from the bitstream, for exampie from a reference picture index, that a base-layer picture is used as an inter prediction reference for the enhancement layer. When a decoded base-layer picture is used as the prethction reference for an enhancement layer, it is referred to as an inter-layer reference picture.

While the previous paragraph described a scable video codec with two scalability layers with an enhancement layer and a base ayer, it needs to he understood that the description can be generalized to any two layers in a scalahiUty hierarchy with more than two layers. In this case, a second enhancement layer may depend on a first enhancement layer in encoding and/or decoding processes, and the first enhancement layer may therefore be regarded as the base layer for the encoding and/or decoding of the second enhancement layer. Furthermore, it needs to be understood that there may be inteNlayer reference pictures from more than one layer in a reference picture buffer or reference picture lists of an enhancement layer, and each of these inter-layer reference pictures may be considered to reside in a base layer or a reference layer for the enhancement layer being encoded and/or decoded.

In scalable multiview coding, the same bitstream may contain coded view components of multiple views and at least some coded view components may be coded using quaty and/or spatial scalabiflty.

Work is ongoing to specify scalable and multiview extensions to the HEVC standard.

The multiview extension of HEVC, referred to as MV-HEVC, is similar to the MVC extension of H264/AVC. SimUarly to MVC, in MV-HEVC, inter-view reference pictures can be included in the reference picture list(s) of the current picture being coded or decoded. The scalable extension of HEVC, referred to as SHVC. is planned to be specified so that it uses multi-loop decoding operation (unlike the SVC extension of H.264/AVC). Currently, two designs to realize scalability are investigated for SHVC.

One is reference index based, where an inter-layer reference picture can be included in a one or more reference picture lists of the current picture being coded or decoded (as described above). Another may be referred to as lntraBL or TextureRL, where a specific coding mode, e.g. in CU level. is used for using decoded/reconstructed sample values of a reference layer picture for prediction in an enhancement layer picture. The SHVC development has concentrated on development of spaUal and coarse grain quaUty scalabflity.

ft is possible to use many of the same syntax stwctures, semantics, and decoding processes for MVHEVC and referencendex-hased SHVC, Furthermore, it is possible to use the same syntax structures, semantics, and decoding processes for depth coding too. Hereafter, term scalable multiview extension of HEVC (SMV4-IEVC) is used to refer to a coding process, a decoding process, syntax, and semantics where largely the same (de)coding tools are used regardless of the scalabiflty type and where the reference index based approach without changes in the syntax, semantics, or decoding process below the sflce header is used. SMV4IEVC might not be hrnited to rnultiview, spatial, and coarse grain quaflty scalabiHty but may also support other types of scalability, such as depthenhanced video.

For the enhancement layer coding, the same concepts and coding tools of HEVC may be used in SHVC, MV-HEVC, and/or SMV-HEVC. However, the additional inter-layer prediction Lode, which employ already coded data (including reconstructed picture samples and motion parameters a.k.a motion information) in reference layer for efficienfly coding an enhancement layer, may be integrated to SHVC, MV-HEVC, and/or SMV-HEVC codec.

hi MV-HEVC, SMVHEVC, and reference index based SHVC solution, the block level syntax and decoding process are not changed for supporting inter-layer texture prediction. Only the high-level syntax has been modified (compared to that of HEVC) so that reconstructed pictures (upsampled if necessary) from a reference layer of the same access unit can be used as the reference pictures for coding the current enhancement layer picture. The inter-layer reference pictures as wefl as the temporal reference pictures are included in the reference picture lists. The signafled reference picture index is used to indicate whether the current PredicUon Unit (PU) is predicted from a temporal reference picture or an inter4ayer reference picture. The use of this feature may be controfled by the encoder and indicated in the bitstream for example in a video parameter set, a sequence parameter set, a picture parameter, and/or a sUce header.

The indication(s) may be specific to an enhancement layer, a reference layer, a pair of an enhancement layer and a reference layer, specific Temporafld values, specific picture types (e.g. RAP pictures). specific suce types (ag. P and B slices but not I slices), pictures of a specific POC value, and/or specific access units, for example, The scope an+d/or persistence of the indication(s) may be indicated along with the indication(s) themselves and/or may be inferred.

The reference list(s) in MV-HEVC, SMV-HEVC, and a reference index based SHVC solution may be initialized using a specific process in which the inter'4ayer reference picture(s), if any, may be included in the initial reference picture list(s), are constructed as follows. For example, the temporal references may he firstly added into the reference lists (LO, LI) in the same manner as the reference list construction in HEVC. After that, the Thterayer references may be added after the temporal references. The interayer reference pictures may be for example concluded from the layer dependency information, such as the RefLayerld[ i] variable derived from the VPS extension as described above. The inter-layer reference pictures may be added to the initial reference picture list LO if the current enhancement-layer slice is a P-Slice, and may be added to both initial reference picture lists LO and LI if the current enhancement4ayer slice is a B-Slice. The inter-layer reference pictures may be added to the reference picture lists in a specific order, which can but need not be the same for both reference picture lists, For example, an opposite order of adding inter-layer reference pictures into the initial reference picture list 1 may be used compared to that of the initial reference picture list 0. For example, inter-layer reference pictures may be inserted into the initial reference picture 0 in an ascending order of nuh layer id, while an opposite order may be used to initialize the initial reference picture list 1.

In the coding and/or decoding process? the inter-layer reference pictures may be treated as a long term reference pictures.

In SMV-HEVC and a reference index based SHVC solution, inter-layer motion parameter predicUon may be performed by setting the inter-layer reference picture as the coHocated reference picture for TMVP derivation. A motion field mapping process between two ayers may be performed for example to avoid block level decoding process modification in TMVP derivation. A motion field mapping could also be performed for multiview coding, but a present draft of MV-HEVC does not include such a process. The use of the motion field mapping feature may he controfled by the encoder and indicated in the bftstream for example in a video parameter set, a sequence parameter set, a picture parameter, and/or a sUce header. The indication(s) may be specific to an enhancement layer, a reference layer, a pair of an enhancement layer and a reference layer, specific Temporalld values, specific picture types (e.g. RAP pictures), specific slice types (e.g. P and B slices but not I slices), pictures of a specific POC value, and/or specific access unfts, for example. The scope and/or persistence of the indication(s) may be indbated along ith the indication(s) themselves and/or may be inferred.

In a motion field mapping process for spatial scalability, the motion field of the upsampled inter-layer reference picture is attained based on the motion field of the respective reference layer picture. The motion parameters (which may e.g. include a horizontal and/or verhcal motion vector value and a reference index) and/or a prediction mode for each block of the upsampled inter-layer reference picture may be derived from the corresponding motion parameters and/or prediction mode of the collocated block in the reference layer picture. The block size used for the derivation of the motion parameters and/or prediction mode in the upsampled inter-layer reference picture may be for example 16x16. The 16x16 block &ze is the same as in HEVC TMVP derivation process where compressed motion field of reference picture is used.

The TMVP process of HEVC is limited to one target picture per slice in the merge mode and one collocated picture (per slice). When applying the reference index based scalability on top of HEVC, the TMVP process of HEVC has limited applicability as explained in the foowing in the case of the merge mode, In the example, the target reference picture (with index 0 in the reference picture list) is a short-term reference picture. The mobon vecLor n the coUocated PU, if referiing to a short-term (ST) reference picture, is scaled to form a merge candidate of the current PU (PUO), as shown in the figure lie, wherein MVO is scaled to MVO dudng the merge mode process. However ft the coflocated PU has a motion vector (MV1) referring to an inter-view reference picture, marked as long-term, the motion vector is not used to predict the current PU (PUI).

There might be a significant amount of cohocated PUs (in the coocated picture) which contain motion vectors referring to an inter-view reference picture while the target reference index (being equal to 0) indicates a short-term reference picture. Therefore, disabng prediction from those motion vectors makes the merge mode less efficient.

There have been proposals to overcome this issue, some of which are explained in the following paragraphs.

An additional target reference index may be indicated by the encoder in the bitstream and decoded by the decoder from the bitstream and/or inferred by the encoder and/or the decoder. As shown in the figure lib, MVI of the co-located block of PU1 can be used to form a disparity motion vector merging candidate, In general, when the reference index equal to 0 represents a short-term reference picture, the additional target reference index is used to represent a long-term reference picture. When the reference index equal to 0 represents a long-term reference picture, the additional target reference index is used to represent a short-temi reference picture.

The methods to indicate or infer the additional reference index include but are not limited to the following: Indication the additional target reference index in the bitstream. for example within the slice segment header syntax structure.

Deriving the changed target reference index to be equal to the smaUest reference index which has a different marking (as used as short-term or long4erm reference) from that of reference index 0.

In the case the co-located PU points to a reference picture having a different layer identifier (equ& to layerA) than that for reference index 0, deriving the changed target reference index to be equal to the smaUest reference index that has layer identifier equal to layerA.

In the merge mode process the defatht target picture (with reference index 0) is used when its marking as short-term or ongterm reference picture is the same as that of the reference picture of the coflocated block. Otherwise (i.e., when the marking of the reference picture corresponding to the additional reference index as short-term or long term reference picture is the same as that of the reference picture of the collocated block), the target picture identified by the additional reference index is used.

In a textureRL based SHVC solution, the inteNayer texture prediction may be performed at CU level for which a new prediction mode, named as textureRL mode, is introduced. The coHocated upsampled base layer block is used as the prediction for the enhancement layer CU coded in textureRL mode. For an input CU of the enhancement layer encoder, the CU mode may be determined among intra, inter and textureRL modes, for example. The use of the textureRL feature may be controHed by the encoder and indicated in the bitstream for example in a video parameter set, a sequence parameter set, a picture parameter, and/or a slice header. The indication(s) may be specific to an enhancement layer, a reference layer, a pair of an enhancement layer and a reference layer, specific Temporaild values, specific picture types (e.g. RAP pictures), specific sDce types (e.g. P and B slices but not slices), pictures of a specific POC value, and/or specific access units, for example. The scope and/or persistence of the indication(s) may be indicated along with the indication(s) themselves and/or may be inferred. Furthermore, the textureRL may be selected by the encoder at CU level and may be indicated in the bitstream per each CU for example using a CU level flag (texture rLflag) which may be entropy-coded e.g. using context adaptive arithmetic coding (e.g. CABAC).

The residue of textureRL predicted CU may be coded as foflows, The transform process of textureRL prethcted CU may be the same as that for the intra predicted cu, where a discrete sine transform (DST) is applied to TU of kima component havkig 4x4 size and a discrete cosine transform (DCT) is applied to the other type of TUs, Transform coefficient coding of a textureRL-predicted CU may he the same to that of inter predicted CU, where no residue flag may be used to indicate whether the coefficients of the whoe CU are skipped.

n a textureRL based SHVC solution, in addition to spatially and temporafly n&ghboring PUs, the motion parameters of the collocated referenceayer block may also used to form the merge candidate st. The base layer merge candidate may be derived at a location collocated to the central position of the current PU and may be inserted in a particular location of the merge list, such as as the first candidate in merge list. In the case of spatial scalability, the reference1ayer motion vector may be scaled according to the spatial resolution ratio between the two layers. The pruning (duphcated candidates check) may be performed for each spatiaHy neighboring candidate with coflocated base layer candidate. For the coflocated base layer merge candidate and spatial merge candidate derivation, a cerlain maximum number of merge candidates may be used; for example four merge candidates may be selected among candidates that are located in six different positions. The temporal merge candidate may be derived in the same manner as done for HEVC merge 1st. When the number of candidates does not reach to maximum number of merge candidates (which may be determined by the encoder and may be indicated in the bitstream and may be assigned to the variable MaxNumMergeCand) the additional candidates, including combined bi-predictive candidates and zero merge candidates, may be generated and added at the end of the merge list, simfiarly or identicaliy to HEVC merge list construction.

In some coding and/or decoding arrangements, a reference index based scalabUity and a blockievel scalabifity approach, such a textureRL based approach, may be combined.

For example, multiview-videoplusdepth coding and/or decoding may be performed as foHows. A textureRL approach may be used between the components of the same view.

For example. a depth view component may be interIayer predicted using a textureRL approach from a texture view component of the same view, A reference index based approach may be used used for interview prediction, and in some embodiments intefr view prediction may be appfled only between view components of the same component type.

Work is also ongoing to specify depthenhanced video coding extensions to the HEVC standard, which may be referred to as 3DHEVC, in which texture views and depth views may be coded into a single bitstream where some of the texture views may he compatible with HEVC. In other words, an HEVC decoder may be able to decode some of the texture views of such a bitstream and can omit the remaining texture views and depth views.

Other types of scalability and scalable video coding include bitdepth scalability, where base layer pictures are coded at lower bitdepth (e.g. 8 bits) per luma andior chroma sample than enhancement layer pictures (e.g. 10 or 12 bits), chroma format scalability, where enhancement layer pictures provide higher fidelity and/or higher spatial resolution in chrome (e.g. coded in 4:4:4 chrome format) than base layer pictures (e.g. 4:2:0 format), and color gamut scalability, where the enhancement layer pictures have a richer/broader color representation range than that of the base layer pictures for example the enhancement layer may have UHDTV (lTUR BT.2020) coor gamut and the base layer may have the lTUR BT.709 color gamut. Any number of such other types of sociability may be realized for example with a reference index based approach or a blockbased approach e.g. as described above, An access unit and a coded picture may be defined for example in one of the followina ways in various HEVC extensions: A coded picture may be defined as a coded representation of a picture comprising VCL NAL units with a particLdar value of nuh layerid and containing aU coding tree units of the picture. An access unit may he defined as set of NAL units that are associated with each other according to a specified classification rule, are consecutive in decoding order, and contain exactly one coded picture.

A coded picture may be defined a coded representation of a picture comprising VCL NAL units wt a particular value of nuh Jayerjd and containing all coding tree units of the picture. An access unit may be defined to comprise a coded picture with nuhJayerid equal to 0 and zero or more coded picture pictures with nonzero nuhlayerid.

A coded picture may be defined as a coded picture to comprise VCL NAL units of nuh layer Id equal to 0 (only), a layer picture may be defined to comprise VCL NAL units of a particular nonzero nuh layerid. An access unit may be defined to comprise a coded picture and zero or more layer pictures.

The constraints on NAL unit order may need to be specified using different phrasing depending on which option to define an access unit and a coded picture is used.

Furthermore, the hypothetical reference decoder (HRD) may need to be specified using different phrasing depending on which option to define an access unit and a coded picture is used. It is anyhow possible to specify identical NAL unit order constraints and HRD operation for all options. Moreover, a majority of decoding processes is specified for coded pictures and parts thereof (eg. coded sUces) and hence the decision on which option to define an access unit and a coded picture have only a smafl or a nonexisting impact on the way the decoding processes are specified. In some embodiments, the first option above may be used but it should be understood that some embodiments may be similarly described using the other definitions.

Assuming the first option to define an access unit and a coded picture, a coded video sequence may be defined as a sequence of access units that consists, in decoding order, of a CRA access unit with nuh laverid equal to 0 that is the first access unit in the bitstream, an DR access unit with nuh layer id equal to 0 or a BLA access unit with nuhlayerid equal to 0, foUowed by zero or more access units none of which is an OR access unit with nuhjayerjd equal to 0 nor a BLA access unit wfth nuh layerid equal to 0 up to but not induding any subsequent IDR or F3LA access unit with nuhJayerid equal to 0.

Term temporal instant or time instant or time entity may be defined to represent a same capturing time or output time or output order. For example, if a first view component of a first view is captured at the same time as a second view component in a second view, these two view components may be considered to be of the same time instant. An access unit may be defined to contain pictures (or view components) of the same time instant, and hence in this case pictures residing in an access unit may be considered to be of the same time instant. Pictures of the same time instant may be indicated (e.g. by the encoder) using multiple means and may be identified (e.g. by the decoding) using multiple means, such as a picture order count (POC) value or a timestamp (e.g. an output timestamp).

Many video encoders utilize the Lagrangian cost function to find ratedistortion optimal coding modes, for example the desired macroblock mode and associated motion vectors. This type of cost function uses a weighting factor or A to tie together the exact or estimated image distortion due to lossy coding methods and the exact or estimated amount of information required to represent the pixeUsample values in an image area.

The Lagrangian cost function may be represented by the equation: C=D+AR (11) where C is the Lagrangian cost to be minimised, 0 is the image distorlion (for example, the meansquared error between the pixel/sample values in original image block and in coded image block) with the mode and motion vectors currently considered, A is a Lagrangian coefficient and R is the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).

hi the foUowing, the term layer is used in context of any type of scalabiUty, including view scalabihty and depth enhancements. An enhancement layer refers to any type of an enhancement, such as SNR, spaflal, multiview, depth, bit-depth, chroma format, and/or color gamut enhancement. A base layer also refers to any type of a base operation point, such as a base view, a base layer for SNRfspatial scalability, or a texture base view for depthenhanced video coding.

There are ongoing standardization activities to specify a multiview extension of HEVC (which may be referred to as MV-HEVC), a depthenhanced muitMew extension of HEVC (which may be referred to as 3D-HEVC), and a scalable extension of HEVC (which may be referred to as SHVC). A multiloop decoding operation has been

envisioned to be used in afl these specifications.

In scalable video coding schemes uthizing multi-loop (de)coding, decoded reference pictures for each (de)coded layer may be maintained in a decoded picture buffer (DPB).

The memory consumption for DPB may therefore be significanfly higher than that for scalable video coding schemes with single-bop (de)coding operation. However, multk loop (de)coding may have other advantages, such as relatively few additional parts compared to singleiayer coding.

In scalable video coding with multi-loop decoding; enhanced ayers may be predicted from pictures that had been already decoded in the base (reference) layer. Such pictures may be stored in the DPB o base layer and may be marked as used for reference. In certain circumstances, a picture marked as used for reFerence may be stored in fast memory, in order to pmvide fast random access to its samples, and may remain stored after the picture is supposed to be displayed in order to be used as reference for prediction. This imposes requirements on memory organization. In order to relax such memory requirements, a conventional design in multi-loop multhayer video coding schemes (such as MVC) assumes restricted utilization of inter-ayer predictions.

Inter-layer/inter-view prediction for enhanced view is allowed from a decoded picture of the base view located at the same access unit, in other word representing the scene at the same time entfty. In such designs, the number of reference pictures avaUable for predicting enhanced views is increased by I for each reference view.

It has been proposed that in scalable video coding with multiloop (de)coding operation pictures marked as used for reference need not originate from the same access units in all layers. For example, a smaUer number of reference pictures may be maintained in an enhancement layer compared to the base layer. In some embodiments a temporal inter-layer prediction, which may also be referred to as a diagonal inter-layer prediction or diagonal prediction, can be used to improve compression efficiency in such coding scenarios.

Another, complementary way of categorizing different types of prediction is to consider across which domains or scalabHity types the prediction crosses. This categorization may lead into one or more of the following types of prediction, which may also sometimes be referred to as prediction directions: Temporal prediction eg. of sample values or motion vectors from an earlier picture usually of the same scalabUity layer, view and component type (texture or depth).

Inter-view prediction (which may be also referred to as cross-view prediction) referring to prediction taking place between view components usuaHy of the same time instant or access unit and the same component type.

Interlayer prediction referring to prediction taking place between layers usually of the same lime instant, of the same component type, and of the same view.

Inter-component prediction may be defined to comprise prediction of syntax element values, sample values, variable values used in the decoding process, or anything alike from a component picture of one type to a component picture of another type. For example, inter-component prediction may comprise prediction of a texture view component from a depth view component, or vice versa.

Prediction approaches using image information from a previously coded image can also be called as inter prediction methods. Inter prediction may sometimes be considered to only include motioncompensated temporal prediction, while it may sometimes be considered to include aU types of prediction where a reconstructed/decoded block of samples is used as prediction source, therefore including conventional interview prediction for example. Inter prediction may be considered to comprise only sample prediction but it may alternatively be considered to comprise both sample and syntax prediction. As a result of syntax and sample prediction, a predicted block of pix&s of samples may be obtained.

If the prediction, such as predicted variable values and/or prediction blocks, is not refined by the encoder using any form of prediction error or residual coding, prediction may be referred to as inheritance. For example, in the merge mode of HEVC, the prediction motion information is not refined e.g. by (de)coding motion vector differences, and hence the merge mode may be considered as an example of motion information inheritance.

Video coding schemes may utifize a prediction scheme between pictures. Prediction may be performed in the encoder for example through a process of block partitioning and block matching between a currently coded block (Ch) in the current picture and a reference block (Rb) in the picture which is selected as reference. Therefore, parameters of such a prediction can be defined as motion information (Ml) comprising for example one or more of the following: spatial coordinates of the Cb (e.g. coordinates of the top-left pixel of the Cb), a reference index refldx which specifies the picture in reference picture lisL which is selected as reference picture, a motion vector (MV) specifying displacement between spatial coordinate of the Cb and Rb in the reference picture, and the size and shape of the motion partition (the size and shape of the matching block).

For example, the motion information for Cb can be defined as follows: Ml(Cb) {coordinates(b), refldx(Ch). MV(Ch), sizes(Cb) (12) In equation (12), terms are defined as following. Reference index refldx(Cb) specifies the reference picture in the reference picture list which is utilized for Cb prediction and contain reference block Rb. Motion vector MV(Cb) = {mvx, mvy} specifies the displacement of the spatial coordinates of the currently coded block Cb and its reference block Rb. Spatial coordinates coordinates(Cb)=(x, y} specifies the locaflon of the top4eft pixel containing the Cb block in currently coded picture. sizes(Cb) = {height, width) specify the dimensions of the current Cb in horizontal and vertical directions, for example in terms of luma samples. A reference block Rb = R(Cb) which is selected for motion prediction of the currently coded block (Cb) may be obtained by applying motion information Ml(Cb) to the currently coded block Cb.

In some embodiments another set of motion parameters than that listed above in Equation (12) may be selected for the motion information Ml. Some motion parameters have been listed earlier.

In some embodiments, Ml may include information of the prediction type (eg. intra, uni prediction, bi-prediction). In the case of biprediction, Ml may include two reference indexes and two motion vectors.

In some embodiments, motion information thaL may be utilized for coding of a current block (Cb), for example for motion vector prediction, may be obtained from a block located in spatial and/or temporal neghborhood of the Cb, This block serves as a source of motion informafion and named as a source block (Sb).

AlternaUvety or in addition, motion information that may be utilized for coding of a current block (Cb) , for example for motion vector prediction, may be obtained from a block obtained through a process of motion compensated prediction (MCP), disparity compensated prediction (DCP). view synthesis based prediction (VSP), and/or inter layer prediction, therefore may be located in picture that belongs to a different layer or in a picture derived as part of the (de)coding process, such as a view synthesis reference picture. which is not coded and is intended for (de)coding operations only.

Alternatively or in addition, motion information that may be utilized for coding of a current block (Cb), for example for motion vector prediction, may be obtained from a block obtained through the process of motion compensated prediction (MCP), disparity compensated prediction (OCR), view synthesis based prediction (VSP), and/or interfl layer prediction, therefore may he located in picture that belongs to a different layer and/or represents a different time entity than that of the current picture.

In some embodiments, motion information of the source block Ml(Sb) may he uflUzed for prediction of motion information Ml(Cb) of the current block. Said utilization can be conducted in a fOrm of a motion information inheritance, and/or a motion information prediction, and/or through other derivatives, eg. non$near restriction.

In some embodiments, motion information of the source block MiS = Ml(Sb) may be adjusted in order to be utilized for motion information prediction of the current block Ch.

Said adjustment may be performed in form of scaling (e.g. multiplying by a factor) and/or offsetting (ag. adding an offset value) particular parameters of MIS and this process results in producing MIS adjusted (MISA), Parameters of motion information adjustment may be for example scale factor scaleX and horizontal and vertical displacement offsetsX = {offset x, offset y, where X may represent a parameter within MIS.

Said adjustment of MIS to MISA can be performed for a complete set of motion information Ml or for its fraction as following: M ISA = {coordinates(M S)*scalel offset1, MV(MlS)*scale2 + offsets2, sizes(M I S)*scale3+offset3} Derivation of Disparity Vectors The concept of inter-view motion prediction may require a disparity vector to locate a corresponding block of the current PU/CU in an already coded picture of the seine time instance. Therefore, an encoder and/or a decoder may provide e.g. four possibfliUes to derive a disparity vector.

In a first method the disparity vector is derived from a depth map belonging to a view coded prior to the current view. Therefore, the complete depth map may be warped to the current view. In a second method a complete ow resolution depth map is estimated for the current picture without uUUzing any coded depth maps. In a third method coding of a depth map is not required as weD. However, in contrast to the second method an estimation of a complete depth map may not he performed. The disparity may be derived from spatial and temporal neighbouring blocks which are using nterview prediction or from motion vectors which may he obtained by inter-view prediction. A fourth method combines the first and third methods. In contrast to the first method where forward warping is utilized, in the fourth method a disparity vector may be derived first as may be done in the third method. This disparity vector may then be utilized to identify a depth block in an already coded depth view to perform backward warping.

The first and fourth methods may need the transmission of depth data as part of the bitstream, and by using one of these methods a decoder may decode the depth maps of previously coded views for decoding dependent views. The second and third methods may also be applicable if depth maps are not coded inside the bitstream, and when depth maps are coded, the decoding of the video pictures may stiD be independent of the depth maps.

In the foflowing, afl the above mentioned four methods are described in more detail by which a suitable disparity vector for the current block can be derived based on already transmitted information, All of the methods or any combination of them may have been integrated in the encoder and/or the decoder, and one of the methods may be chosen by configuring the encoder/decoder accordingly. I St

In the first method the depth data may be transmitted as a part of the bitstream, and a decoder using this method decodes the depth maps of previously coded views for decoding dependent views. In other words, the depth map estimate can be based on an already coded depth map. If the depth map for a reference view is coded before the current picture, the reconstructed depth map can be mapped into the coordinate -system of the current picture for obtaining a suitable depth map estimate for the current picture.

In Figure 13, such a mapping is illustrated for a simple depth map, which consists of a square foreground object and background with constant depth. For each sample of the given depth map, the depth sample value may he converted into a sampleaccurate disparity vector. Then, each sample of the depth map may be displaced by the disparity vector. If two or more samples are displaced to the same sample location, the sample value that represents the minimal distance from the camera (Ic., the sample with the larger value in some embodiments) may he chosen, for example. In general, the described mapping leads to sample locations in the target view to which no depth sample value is assigned. These sample locations are depicted as a black area in the middle of the picture of Figure 13. These areas represent parts of the background that are uncovered due to the movement of the camera and can be filled using surrounding background sample values. A simple hole filling algorithm may be used which processes the converted depth map line by line. Each Hoe segment that con&sts of a successive sample location to which no value has been assigned is filled with the depth value of the two neighbouring samples that represent a larger distance to the camera (i.e., the smaller depth value in some embodiments).

The disparity vector used for interview motion or residual prediction of a block of the current picture may be derived based on the maximum value within the associated depth block.

The left part of Figure 13 illustrates the original depth map; the middle part illustrates the converted depth map after displacing the original samples; and the right part illustrates the final converted depth map after filling the holes.

The above described method may only be applicable if depth maps are included in the bitstream, and by using this method, the video pictures (except the base view) cannot be decoded independently of the depth maps.

In the second example method the depth map estimate is based on data that are available in the coded representations of the video pictures, eg. coded disparity and motion vectors. When using this method, one depth sample may be derived for e.g. a 4x4 block of luma samples. Consequently, the estimated depth maps have 1/44h of the horizontal and vertical resolution of the luma components. The disparity vector used for interMew motion or residual prediction of a block of the current picture may be derived based on the maximum value within the associated depth block.

In random access units, all blocks of the base view picture, are intracoded, In the pictures of dependent views, most blocks may he coded using disparitycompensated prediction (DCP, also known as interMew prediction) and the remaining blocks may be intracoded. When coding the first dependent view in a random access unit, no depth or disparity information is available. Hence, candidate disparity vectors can only be derived using a local neighbourhood, Le., by conventional motion vector prediction. But after coding the first dependent view in a random access unit, the transmitted disparity vectors can be used for deriving a depth map estimate, as it is illustrated in Figure 14.

Therefore, the disparity vectors used for disparitycompensated prediction are converted into depth values and all depth samples of a disparity<ompensated block are set equal to the derived depth value. The depth samples of intracoded blocks are derived based on the depth samples of neighbouring blocks; the used algorithm is similar to spatial intra prediction. If more than two views are coded, the obtained depth map can be mapped into other views using the method described above and used as a depth map estimate for deriving candidate disparity vectors.

The depth map estimate for the picture of the first dependent view in a random access unit is used for deriving a depth map for the next picture of the first dependent view. The basic principle of the algorithm is illustrated in Figure 15. After coding the picture of the first dependent view in a random access unit, the derived depth map is mapped nto the base view and stored together with the reconstructed picture. The next picture of the base view may be kiter-coded, for example. For each block that is coded using a motion compensated prediction (MCP), the associated motion parameters are applied to the depth map estimate. A corresponding block of depth map samples is obtained by motion compensated prediction with the same motion parameters as for the associated texture block; instead of a reconstructed video picture the associated depth map estimate is used as a reference picture. In order to simplify the motion compensation and avoid the generation of new depth map values, the motion compensated prediction for depth block may not involve any interpolation. The motion vectors may be rounded to sample-precision before they are used. The depth map samples of intra-coded blocks are again determined on the basis of neighbouring depth map samples. Finally, the depth map estimate for the first dependent view, which is used for the inter-view prediction of motion parameters, is derived by mapping the obtained depth map estimate for the base view into the first dependent view.

After coding the second picture of the first dependent view, the estimate of the depth map is updated based on actually coded motion and disparity parameters, as it is illustrated in Figure 16. For blocks that are coded using disparity-compensated prediction, the depth map samples are obtained by converting the disparity vector into a depth value, The depth map samples for blocks that are coded using motion compensated prediction can be obtained by motion compensated prediction of the previously estimated depth maps, similar as for the base view. In order to account for potential depth changes, a mechanism by which new depth values are determined by adding a depth correction may be used. The depth correction is derived by converting the difference between the motion vectors for the current block and the corresponding reference block of the base view into a depth difference. The depth values for intra-coded blocks are again determined by a spatial prediction. The updated depth map is mapped into the base view and stored together with the reconstructed picture. U can also be used for deriving a depth map estimate for other views in the same access unit.

iflOflOC For the foflowing pictures, the described process is repeated. After coding the base view picture, a depth map estimate for the base view picture is deLermined by motion compensated prediction using the transmitted motion parameters. This estimate is mapped into the second view and used for the interview prediction of motion parameters. After coding the picture of the second view, the depth map estimate is updated using the actually used coding parameters. At the next random access unit, the interview motion parameter prediction is not used, and after decoding the first dependent view of the random access unit, the depth map may be re-initialized as described above.

hi the third example method the disparity vector is derived from a motion vector of a spatial or temporal DCP neighbouring block or from a disparity vector associated with an MCP neighbouring block. Once a disparity motion vector is found, the whole disparity vector derivation process may terminate, First temporal DCP neighbouring blocks may be evaluated as follows.

Up to two reference pictures from the current view may be treated as candidate pictures for temporal neighbours. The first candidate picture is the colocated picture as used for Temporal Motion Vector Prediction (TMVP) in HEVC without low delay check. The co-located picture may be indicated in a shce header. The second picture is derived in the reference picture lists with the ascending order of reference picture indices, and added into the candidate Hat, given as follows.

A random access point (RAP) is searched in the reference picture lists. If found, the random access point is placed into the candidate list for the second picture and the derivation process is completed. In a case that the random access point is not available for the current picture, the following may be applied.

A picture with the lowest temporailD (TID) is searched out and placed into the candidate list of the temporal pictures as the second entry.

krrQrloc If multiple pictures with the same lowest temporailD exist, a picture of less picture order count difference with the current picture may be chosen.

As shown in the above description, the second temporal candidate picture may be chosen in a way that disparity motion vectors can have more chance to be present in the picture. The derivation process of the second candidate picture can he done in the slice level and be invoked only once per sUce, For each candidate picture up to two temporal neighbouring blocks hottoniright (BR) and Centre as depicted in Figure 18 are searched. The search order may be as foflows: first the BR block and then the Centre bbck, The BR block may not he considered when it is located below the lower CTU row of the current CTU, as e.g. depicted in Figure 20.

When the search of the two neighboring bbcks BR and Centre has been performed, a check of the spatial DOP neighbors may next he performed ag. as follows. Five spatial neighboring blocks may be used for the disparity vector derivation. They may be blocks below-left, left, aboveright, above and above-eft bbcks of the current prediction unit (PU), denoted by A0, A1, B0, B or B2, as defined in Figure 17. To enable the disparity vector derivation process to be performed in a parallel way two constraints on searched blocks may be applied. The first constraint is that disparity vectors are not derived from neighbouring blocks in the same CUs, when the CU contains two PUs. Figure 19 shows an example where for the second PU block A1 is not used for disparity vector derivation.

After checking the spatial DCP neighbours, MCP coded neighbour blocks may be searched e.g. as follows. In addition to the DCP coded blocks, blocks coded by motion compensated prediction (MCP) are also used for the disparity derivation process. When a neighbour block is MCP coded block and its motion is predicted by the inteNview motion prediction, as shown in Figure 21, the disparity vector used for the interview motion prediction represents a motion correspondence between the current and the interview reference picture. This type of motion vector may be referred to as an interfl view predicted motion vector (lvpMv) and the blocks may be referred to as DV-MCP blocks in the sequel. The motion correspondence may be used for the disparity derivation process as explained in the foHowing.

To inthcate whether a blocks is DVMCP block or not and to save the disparity vector used for the interview motion prediction, two variables may be used: vpMvRag and lvpMvDispadtyX.

The block whose motion vector is interview predicted is identified when the 0th motion parameter candidate of MERO/SKIP mode is s&ected. In that case, the lvpMvFlag and lvpMvDisparityX corresponding to the location of current PU may be set La I and the horizontal component of the disparity vector used for the interview motion prediction, respectively.

The disparity vector is derived from SKIP coded DVMCP blocks. When a block is coded by skip mode, neither mvd (motion vector difference) data nor residual data is signafled, which impUes that the disparity vector used for SKIP coded DVMCP block may better describe the motion correspondence than the disparity vector used for DV-MCP blocks that are not SKIP coded.

If a DCP coded block is not found in the spatial and temporal neighbor blocks, then disparity derivation process may scan the spatial neighbor blocks for DVMCP compensated in a following order: AO, Al, 30, BI, 32. If a neighbor block is a SKIP coded DV-MCP block, then the value of lvpMvDisparityX at the neighbor block may be returned as the derived disparity. The vertical component of the disparity vector is set equal to zero.

To reduce the amount of memory required for the derivation of the disparity from DV MCP blocks, blocks 130, 31 and 32 may only be utilized when they are located in the current CTU. An example for this can be seen in Figure 22. Here only spatial neighbor blocks Al and A0 are utilized.

xrr'o,,oc When no disparity motion vector is found from the nSghboring blocks, a zero disparity vector may be used for inter-Mew motion prediction.

In the fourth example method the disparity vector may be derived from a depth map of a different view component. While coding the texture of a dependent view, the decoded depth of the base view is already avaUable. So the disparity derivation needed for the coding of the texture of the dependent view may be improved by utUizing the depth map of the base view. A disparity vector, which might be a belier estimate than a disparity vector derived with the third method, may be extracted by the following steps.

A disparity vector is derived by the third method and the disparity vector is used to locate the corresponding block in the coded depth of the base view. The depth in the corresponding block in the base depth is assumed to be the "virtual depth block" of the current block in the dependent view. The maximum depth value of the virtual depth block (or alternatively of the centre and edge samples of the virtual depth block) may be retrieved. The maximum depth value may be converted to disparity. An example is depicted in Figure 23. The coded depth map in view 0 is denoted as Coded DO. The texture to be coded is Ti. For the current block (CB) a depth block in the coded DO may be derived using disparity vector estimated by the third method.

In a nearest block disparity vector derivation (NBDV) process neighbouring blocks are checked in order, and, once a neighbouring block contains a disparity motion vector (thus it is nterview predicted), the motion vector is identified to be a disparity vector, and the whole process terminates. Temporal neighbouring blocks are checked first and the spatial neighbouring blocks are checked afterwards. The identified disparity vector is the result of the NBDV process.

Disparity derivation and disparity vector(s) resulting from the disparity derivation may be used ri various (de)coding processes or algorithms including but not limited to one or more of the following: -usage of the disparity vector in motion vector prediction; t c n n oc usage of the disparity vector in view synthesis prediclion or disparity-compensated inter-view prediction; usage of the disparity vector in inter-view residual prediction; usage of the disparity vector in inter-view disparity-compensated parameter prediction.

The generatbn of the (de)coded depth map or depth data can in some circumstances lead to depth data which is considered to be not valid and/or avaflable. In other word the depth data is unable to produce good quality results in the decoder. The (de)coded depth data can be considered to be unavailable (or invalid or simliar) for various reasons including but not limited to: The (de)coded depth data is used for disparity vector derivation or alike in the texture video (de)codinq but is of insufficient quality andior resolution for other purposes such as synthesizing views for display.

The (de)coded data for a certain area may have been determined by the encoder as unnecessary for view synthesis, for example in the view synthesis optimization process or similar of the encoder.

The original uncompressed depth data may have been unavailable for the encoder. For example, the depth estimation process may have failed to produce depth data for a certain spatial area, such as an area where camera views do not overlap.

The required bitrate to encode a certain area of depth map (ag, an area with narrow stripes with different depth values) may be so high that RDO decides it is not The quality of depth maps using depth estimation (based on available decoded texture views) might be high enough/comparable to the quality of decoded depth maps.

Therefore the concept reflected in embodiments as described herein is to determine at least one of the following and indicate to the decoder when it occurs: whether or not the decoded depth data is available or valid; which parts of the decoded depth data are available or valid (and which not); for which purpose(s) the depth data is available or vaUd (and for which purposes it is not available or valid); and whether the depth estimation succeeds to estimate depth maps (from available decoded texture views) of enough or comparable quality to decoded depth maps.

Figure 4b shows an example depth data vafldity determiner apparatus as employed in the encoder according to some embodiments, Furthermore with respect to Figure 24 the operation of the example depth data validity determiner apparatus as employed in the encoder according to some embodiments is shown, In some embodiments the depth data validity determiner apparatus comprises a depth data vaUdity determiner 4001.

The depth data vaUdity determiner 4001 can be configured to receive the decoded depth data and perform a vahdity check on the data, For example in some embodiments the depth data validity determiner 4001 can be configured to determine whether the depth data is available (or valid) at aU, For example the depth estimation process may have failed to produce any depth data due to a failure in the image capture or encoding process.

In some embodiments the depth data vadity determiner 4001 can he configured to determine which parts of the decoded depth data are available (or valid) and consequently which parts of the decoded depth data are not available (or not valid). For example some parts may be not available where the depth estimation process has failed to produce depth data for a spatial area such as an area where camera views do not overlap.

In some embodiments the depth data validity determiner 4001 can be configured to determine where the depth data is valid or available for specific purposes or specific processing operations. A processing operation may take place as a part of the encoding and/or decoding process and/or a processing operation may take place outside the encoding and/or decoding process, for example as a postprocessing operation. For example in some embodiments the encoder could have determined that the depth data *KTflflJf is determined as being unnecessary or not available for view synthesis because of the operation of a view synthesis optimisafion process or similar in the encoder, Said view synthesis may be used for view synthesis prediction or alike within the (de)coding loop and/or the encoder may expect or assume or have knowledge that the decoder side uses said view synthesis subsequent to decoding for example for display purposes eg, to generate additional view for a multi-view auto-stereoscopic display. Additionally or afternativ&y, in some embodiments the encoder could have determined that the depth data is determined as being unnecessary or not available for disparity derivation and/or any (de)coding process using the disparity derivation.

Furthermore in some embodiments the depth data validity determiner 4001 can perform a check on the depth data to check the depth data quality. For example the required bit rate to indude a certain area of depth map may be so high that the RDO decides it is not worth encoding. Furthermore in some circumstances the depth data vaildity determiner determines that the quality of the depth estimation is high enough or comparable to the quality of the depth data In some embodiments a combination of these determinations can be made, for example a quality determination to determine which use the depth data can be permitted. For example in some embodiments the depth data is determined to be of sufficient quality to be used for disparity vector derivation or alike in the texture video (de)coding but determined to be of insufficient quaUty and/or resolution for other purposes such as synthesizing views for display. Thus it would be understood that the determination can be any suitable selection or combination of determinations.

The operation of performing a validity check on the depth data such as whether it is available, partially avaUable, which purposes it is available for, and the quality of the data is shown in Figure 24 by step 4101.

In some embodiments the depth data validity determiner apparatus comprises a depth data validity indicator inserter 4003. The depth data validity indicator inserter 4003 can arF,L-.Iz be configured to generate a suitable indication to be inserted into the datastream or the bitstream indicating the availabllity and valldity of the gaps data. Thus for example the indicator can be one indicating whether the depth data is totally valld, partially valid (and which parts are valid), the purposes to which the data can be used, and the quality of the data.

It would be understood that in some embodiments any combination or selection of the types of indicator can be used.

The encoder may encode indication(s) in the bitstream and consequently the decoder may decode indication(s) from the bitstream controlling whether or not an associated depth view and/or associated depth view components are to be output from the decoding process. The indication(s) may reside for example in a sequence parameter set, a picture parameter set, a supplemental enhancement information (SEI) message, or a slice header, or any other syntax structure.

The syntax structure may determine the scope or validity or persistence of the indication(s). For example, if the indication resides in a sequence parameter set, it can in some embodiments be vahd for the coded video sequence for which the sequence parameter set is active. Likewise, if the indication resides in a picture parameter set, it may be valid for the picture or view component for whàch the picture parameter set is active. Alternatively, the scope/validity/persistence of the indication(s) may be included in the indication(s) or other syntax elements associated with indication(s).

In some embodiments the depth data validity indicator inserter 4003 can be configured to insert the indication(s) into a sequence parameter set, for example in the context of 3D-AVC such as: .rt'on ° ±c).

In some embodiments the depth output flag may affect the decoded picture output and removal processes for depth view components, which may be specified wfthin the reference decoding process for example as a part of the HRD specification or simHar.

The depth output flag or simUar may contro whether the associated depth picture(s) are initiafly marked as needed for output" or simar, i.e. whether the associated depth picture(s) are to be output by the decoder. The associated depth picture(s) may be for example those for which the sequence parameter set including the depth output flag is an active sequence parameter set or an active layer or view sequence parameter set.

Furthermore in some embodiments the deptoutputjiag is valid for aM depth vews.

The depth data vaUdity indicator inserter 4003 can be configured in some embodiments to specfty that for a decoded depth view component of each target output view, OutputFlag is set equal to depth output flag (of the active view 30-AVC sequence parameter set for the decoded depth view component), The OutputFlag variable or simi'ar may control whether the associated picture is initially marked as "needed for output' or similar, i.e. whether the associated picture is to be output by the decoder.

In some embodiments the depth data validity indicator inserter 4003 can be configured to insert the depth output flag in a picture parameter set or a slice header. In such embodiments the depth output flag or indicator can be valid for the depth view component or the depth picture referring to the picture parameter set or the depth view view component or the depth picture containing the sUce header, In such embodiments the depth data vafldity indicator inserter 4003 can be configured to specify that for the associated depth view component or the associated depth picLure, OutputFlag is set equM to depth output flag.

NTr'c), C)tC As described herein the encoder (or the depth data validity indicator inserter 4003) can he configured to encode in the bitstream and the decoder may decode from the bitstream an indicator or a map of indicators defining which areas of depth data are avaflable and which not. A map of indicators may for example be represented by a quadtree structure.

Furthermore the encoder (the depth data validity indicator inserter 4003) can generate a suitable indication (which is decoded by the decoder) indicating for which purpose(s) the depth data is available and/or for which purpose(s) the depth data is unavailable, For example in some embodiments, the encoder (the depth data vafidity indicator inserter 4003) can be configured to encode the map as an SB message and the decoder may decode the map from an SEI message.

In some embodiments, the encoder (the depth data validity indicator inserter 4003) can be configured to encode the map as part of coded slice data, For example in some embodiments the indicator or map of indicators is encoded as part of coding unit data, (and the decoder may decode the map from the coded slice data). In some embodiments the indicator or map of indicators can be (de)coded using context adaptive entropy (de)coding, such as CABAC, with context taking into account the map values of the coding units above and on the left of the current coding unit.

In some embodiments, the encoder (the depth data validity indicator inserter 4003) can be configured to replace or control of action of replacing the areas indicated to be unavaUable. The replacement can for example may be done in some embodiments after reconstructing or decoding the area indicated to be unavailable. in some embodiments the replacement can be performed after decoding the depth view component containing the area. in some embodiments the replacement can be performed prior to using the depth view component containing the area as reference for prediction. Furthermore in some embodiments the replacement can be prior to outputting the depth view component containing the area.

In some embodiments, the encoder (the depth data validity indicator inserter 4003) can be configured to indicate in the bitstream at which point the encodeddecoder replaces the areas indicated to be unavailable. Furthermore in some embodiments, the encoder (the depth data validity indicator inserter 4003) and/or the decoder infer when the encoder/decoder replaces the areas indicated to be unavailable.

In some embodiments, the areas indicated to be unavailable may he replaced by a global disparity value indicated by the encoder and/or inferred by the encoder and/or the decoder. Embodiments related the derivation and/or indication of a global disparity value are presented further below.

In sonic embodiments, the areas indicated to be unavailable may be replaced using a nearest block disparity vector derivation process or similar from spatial and/or temporal neighbours.

In some embodiments, the areas indicated to be unavailable are filled with estimated depth values from texture views. For example, a stereo matching algorithm may be applied for decoded or reconstructed texture views.

In some embodiments, the areas indicated to be unavailable are filled using DIBR algorithms applied on available depth maps. For example, a decoded or reconstructed depth picture is warped or mapped, using DBR or view synthesis, to the view represented by the depth picture for which the areas are indicated to be unavailable.

In some embodiments, the areas indicated to be unavailable are not used as reference for prediction or source for disparity vector derivation or anything alike in the encoder and/or the decoder.

In some embodiments, an indicator or a map of indicators defining which areas of depth data are available and which not is output by the encoder and/or the decoder along with the associated depth picture. In some embodiments, addiUonafly or aRernativ&y, an indicator or a map of indicators indicating for wNch purpose(s) the depth data is avaflable and/or for which purpose(s) the depth data is unavailable is output by the encoder and/or the decoder along with the associated depth picture.

In some embodiments, the decoder side avoids using the areas indicated to be unavaUable in a depth-4mage based rendering or view synthesis process.

The operation of inserting or generating a suftable indicator is shown in Figure 24 by step 4103.

Agure 5b shows an example depth data vaUdfty determiner apparatus as employed in the decoder according to some embodiments. Furthermore with respect to Figure 25 the operation of the example depth data vaUdity determiner apparatus as employed in the decoder according to some embodiments is shown in further detaiL In some embodiments the depth data validity determiner comprises a depth data validity indicator detector 5001. The depth data validity indicator detector 5001 is configured to receive the datastream and decode any depth data indicators embedded within the datastreani or the bitstream. It would be understood that any of the example indicator insertion methods can he detected by a suitable detector configured to detect the indicator as described herein.

The depth data ability indicator detector 5001 can then pass the indicator to a depth data decoder controfler 5003.

The operation of decoding the depth data indicator is shown in Figure 25 by step 5101.

In some embodiments the depth database determiner apparatus as employed in the decoder comprises a depth data decoder controller 5003. The depth data decoder controfler 5003 can be configured to receive the decoded indicator and based on the depth data validRy indicator control the use of the depth data.

Thus for example the depth data decoder controfler can control the use of the depth data such that the indicator can control whether the depth data is omitted, replaced, used partiaVy, or used for specific purposes only.

Thus for example the impact of OutputFiag in the example defined herein can be specified as foflows. When OutputFlag is equal to 0, the depth view component is marked as not needed For output". When OutputFlag is equal to 1, the depth view component is marked as "needed for output" untU it is output (and then marked as "not needed for output"), A depth view component may he removed from the decoded picture buffer (in other words the picture storage buffer containing the depth view component may be emptied and used for another view component), when it is marked as "not needed for output" and "unused for reference".

In some embodiments, the depth data decoder controller 5003 can he configured to replace or control of action of replacing the areas indicated to be unavailable. The replacement can for example may be done in some embodiments after decoding the area indicated to be unavailable. In some embodiments the replacement can be performed after decoding the depth view component containing the area In some embodiments the replacement can be performed prior to using the depth view component containing the area as reference for prediction. Furthermore in some embodiments the replacement can be prior to outputhng the depth view component containing the area.

In some embodiments, the depth data decoder controfler 5003 can be configured to decode from the bitstream at which point the decoder replaces the areas indicated to be unavailable. Furthermore in some embodiments, the depth data decoder controUer 5003 infers when the decoder replaces the areas indicated to be unavailable.

In some embodiments, the areas indicated to be unavaUable may be replaced by a global disparity value decoded from the bitstream or inferred by the decoder.

Embodiments related the derivation and/or indication of a global thsparity value are presented further below.

In some embodiments, the areas indicated to be unavailable can be replaced using a nearest block disparity vector derivafion process or simar from spatial and/or temporal neighbours.

In some embodiments, the areas indicated to be unavaHable are fified wh esUmated depth values from texture views. For example, a stereo matching algorithm may be appUed for decoded or reconstructed texture views.

In some embodiments, the areas indicated to be unavailable are fifed using DIBR algorithms appUed on available depth maps. For example, a decoded or reconstructed depLh picture is warped or mapped, using DIBR or view synthesis, to the view represented by the depth picture for which the areas are indicated to be unavailable, In some embodiments, the areas indicated to be unavailable are not used as reference for prediction or source for disparity vector derivation or anything alike in the decoder.

In some embodiments, the decoder side avoids using the areas indicated to be unavailable in a depthimage based rendering or view synthesis process.

In some embodiments, in the encoder andior the decoder may comprise derivation of a disparity vector, which may be for example referred as a global disparity vector, for example using one of the following ways: a median of af registered disparity vectors; an average of all registered disparity vectors; a maximal of aH registered disparity vectors; if a depth picture for the same time instant is avSable, a global disparity vector may be the disparity vector resulting by applying a depth-to-disparity conversion for the depth picture. by quantizing the resulting disparity vectors to certain accuracy, and by selecting the most used quantized disparity vector as the global disparity vector.

Registered disparity vectors may comprise for example aH the inter-view motion vectors of the picture that may be scaled for example to represent disparity between the current picture and the base view. In some embodiments, the registered disparity vector may comprise afl the inter-view motion vectors of the picture up to the current block (exdusive) and the inter-view motion vectors may be scaled for example to represent disparity between the current picture arid the base view.

In some embodiments, the encoder and/or the decoder may derive more than one global disparity vector. For example, referring to the above, the encoder may select N most used quantized disparity vectors, where N> 1.

The one or more global disparity vectors may be vaUd or applicable for a particular unit, which may have spatiaL temporaL and view-wise dimensions. For example the unit may be a slice, a picture, an access unit, a sequence of pictures within a view, or a sequence of access units. Likewise, the one or more global disparity vectors may be derived by the encoder for said particular unit, included in the bitstream to be valid or applicable for said particular unit, and decoded by the decoder for said particular unit.

The unit may be pre-determhied for example in a coding standard. Alternatively, the unit may he determined by the encoder and indicated by the encoder in the bitstream and decoded by the decoder from the bitstream. The unit may be explicifly indicated for example using a syntax element or may be implicitly indicated for example on the basis of the coding layer or hierarchy of the syntax structure containing the syntax elements for indicating the one or more global disparity vectors.

In some embodiments, the encoder encodes one or more global disparity vectors into a syntax structure within a bitstream. The syntax structure may, for example, be the &ice header, the picture parameter set, the sequence parameter set, and/or the video parameter set. When more than one global disparity vectors are present in the syntax structure(s), they may be indexed for example based on their order in the syntax "flnnnn,-structure(s). In some embodiments, the encoder and/or the decoder derive one or more global disparity vectors for example as described above.

The encoder may encode an index into the bitstream to be used for the expHcit disparity stage in the disparity derivation for example for a sUce, for an explicifly indicated quadtree structure within a slice, or for a block (such as a CU or a PU) when eariler stages in the disparity derivation process have not resuRed into a disparity vector. The decoder may decode the index from the bitstream and use the index to select which global disparity vector is used as output of the expUcit disparity stage.

In sonic embodiments, the encoder may indicate in the bitstream and the decoder may decode from the bitstream a global disparity vector that indicates "no disparity vector available" or "NA'. When a global disparity vector equal to NA is selected as an output of the explicit disparity stage for disparity derivation of a texture block, there may be no good disparity vector available for the current block. The encoding and decoding may then be adjusted accordingly to use (de)coding modes that do not make use of intefr view corrSation for the texture block.

In some embodiments, only the horizontal component is coded for the global disparity vector(s), while the vertical component may be inferred to be equal to 0 or may be inferred from the camera parameters. In some embodiments, the precision of the horizontal component may be pre-defined, for example, in a coding standard or indicated by the encoder in the bitstream and decoded by the decoder from the bitstream. For example, the precision may be an integer luma sample precision. The precision need not be the same as the precision to indicate or derive motion vectors in the same coding scheme.

The operation of controUing the depth data use is shown in Figure 25 by step 5103.

Figure 4a shows a block diagram of a video encoder suitable for employing embodiments of the invention. Figure 4a presents an encoder for two layers, but it would be appreciated that presented encoder could be simarly extended to encode more than two layers. Figure 4a illustrates an embodiment of a video encoder comprising a first encoder section 500 for a base layer and a second encoder section 502 for an enhancement layer. Each of the first encoder section 500 and the second encoder section 502 may comprise simflar elements for encoding incoming pictures.

The encoder sections 500, 502 may comprise a pixel predictor 302, 402, prediction error encoder 303, 403 and prediction error decoder 304, 404. Figure 4a also shows an embodiment of the pixel predictor 302, 402 as comprising an interpredictor 30$, 40$, an intra-predictor 308, 408, a mode selector 310, 410, a fitter 316, 416, and a reference Frame memory 318, 418. The pixel predictor 302 of the first encoder section 500 receives 300 base layer images of a video stream to be encoded at both the inter predictor 306 (which determines the difference between the image and a motion compensated reference frame 318) and the intrapredictor 308 (which determines a prediction for an image block based only on the afteady processed parts of current frame or picture). The output of both the interpredictor and the intra-predictor are passed to the mode selector 310. The intra-predictor 308 may have more than one intra-prediction modes. Hence, each mode may perform the intra-prethction and provide the predicted signal to the mode selector 310. The mode selector 310 also receives a copy of the base layer picture 300. Correspondingly, the pixel predictor 402 of the second encoder section 502 receives 400 enhancement layer images of a video stream to be encoded at both the inter-predictor 406 (which determines the difference between the image and a motion compensated reference frame 418) and the intra-predictor 408 (which determines a prediction for an image block based only on the already processed parts of current frame or picture). The output of both the inter-predictor and the intra-predictor are passed to the mode selector 410. The intra-predictor 408 may have more than one intra-prediction modes. Hence, each mode may perform the intraprediction and provide the predicted signal to the mode selector 410. The mode selector 410 also receives a copy of the enhancement layer picture 400.

The mode selector 310 may use, in the cost evaluator block 382, for example Lagrangian cost functions to choose between coding modes and their parameter 1# 0) ft.

values, such as motion vectors, reference indexes, and intra prediction threction, typically on block basis. This kind of cost function may use a weighting factor lambda to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area: C = 0 + lambda ft where C is the Lagrangian cost to be minimized, D is the image distortion (e.g. Mean Squared Error) with the mode and their parameters, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (e.g. including the amount of data to represent the candidate motion vectors).

Depending on which encoding mode is selected to encode the current block, the output of the interpredictor 306, 406 or the output of one of the optional intrapredictor modes or the output of a surface encoder within the mode selector is passed to the output of the mode selector 310, 410. The output of the mode selector is passed to a firsi summing device 321, 421. The first summing device may subtract the output of the pixel predictor 302, 402 from the base layer picture 300/enhancement layer picture 400 to produce a first prediction error signal 320, 420 which is input to the prediction error encoder 303, 403.

The pixel predictor 302, 402 further receives from a preliminary reconstructor 339, 439 the combination of the prediction representation of the image block 312, 412 and the output 338, 438 of the prediction error decoder 304, 404. The preliminary reconstructed image 314, 414 may be passed to the intrapredictor 308, 408 and to a filter 316, 416.

The filter 316, 416 receiving the preliminary representation may filter the preliminary representation and output a final reconstructed image 340, 440 which may be saved in a reference frame memory 318, 418, The reference frame memory 318 may be connected to the interpredictor 306 to be used as the reference image against which a future base layer pictures 300 is compared in interprediction operations. Subject to the base layer being selected and indicated to be source for interiayer sample prediction and/or inteHayer motion information prediction of the enhancement layer according to some embodiments, the reference frame memory 318 may also be connected to the trfl'nn,-nter-predictor 406 to be used as the reference image against which a future enhancement layer pictures 400 is compared in interprediction operations. Moreover, the reference frame memory 418 may be connected to the interpredictor 406 to be used as the reFerence image against which a future enhancement layer pictures 400 is compared in interprediction operations.

Filtering parameters from the filter 316 of the first encoder secUon 500 may be provided to the second encoder section 502 subject to the base layer being selected and indicated to be source for predicting the filtering parameters of the enhancement layer according to some embodiments.

The prediction error encoder 303, 403 comprises a transform unit 342, 442 and a quantizer 344, 444. The transform unit 342, 442 transforms the first prediction error signal 320, 420 to a transform domain. The transform is, for example, the DCI transform. The quantizer 344, 444 quantizes the transform domain signal, e.g. the DCI coefficients, to form quantized coefficients.

The prediction error decoder 304, 404 receives the output from the prediction error encoder 303, 403 and performs the opposite processes of the prediction error encoder 303, 403 to produce a decoded prediction error signal 338, 438 which, when combined with the prediction representation of the image block 312, 412 at the second summing device 339, 439, produces the preliminary reconstructed image 314, 414. The prediction error decoder may be considered to comprise a dequantizer 361, 461, which dequantizes the quantized coefficient values. e.g. DCI coefficients, to reconstruct the transform signal and an inverse transformation unit 383, 463, which performs the inverse transformation to the reconstructed transform signal wherein the output of the inverse transformation unit 363, 463 contains reconstructed block(s), The prediction error decoder may also comprise a block filter which may filter the reconstructed block(s) according to further decoded information and filter parameters. 1tfl'

The entropy encoder 330, 430 receives the output of the prediction error encoder 303, 403 and may perform a suitable entropy encoding/variable length encodkig on the signal to provide error detection and correcUon capabiflty. The outputs of the entropy encoders 330, 430 may be inserted into a bitstream e.g. by a multiplexer 508, For completeness a suitable decoder is hereafter described. However, some decoders may not be able to process enhancement layer data wherein they may not be able to decode aM received images.

At the decoder side simfiar operations may be performed to reconstruct the image blocks. Figure 5a shows a block diagram of a video decoder 550 suitable for employing embodiments of the invention. In this embodiment the video decoder 550 comprises a first decoder section 552 for base view components and a second decoder section 554 for non-base view components. Block 556 illustrates a demultiplexer for delivering information regarding base view components to the first decoder secflon 552 and for dellvering information regarding non-base view components to the second decoder section 554. The decoder shows an entropy decoder 700, 800 which performs an entropy decoding (E1) on the received signal. The entropy decoder thus performs the inverse operation to the entropy encoder 330, 430 of the encoder described above. The entropy decoder 700, 800 outputs the resufts of the entropy decoding to a prediction error decoder 701, 801 and pixel predictor 704, 804. Reference P stands for a predicted representation of an image block. Reference D' stands for a reconstructed prediction error signal. Blocks 705, 805 illustrate preliminary reconstructed images or image blocks (l'). Reference Rn stands for a final reconstructed image or image block.

Blocks 703, 803 illustrate inverse transform (11). Blocks 702, 802 illustrate inverse quantization (Q1). Blocks 706, 80$ illustrate a reference frame memory (RFM). Blocks 707, 807 illustrate prediction (P) (either inter prediction or intra prediction). Blocks 708, 808 illustrate filtering (F). Blocks 709, 809 may be used to combine decoded prediction error information with predicted base view/non-base view components to obtain the preliminary reconstructed images (l'). Preliminary reconstructed and filtered base view images may be output 710 from the First decoder secfion 552 and preliminary KTflOfl'C reconstructed and filtered base view images may be output 810 from the second decoder section 554.

The pixel predictor 704, 804 receives the output of the entropy decoder 700, 800. The output of the entropy decoder 700, 800 may include an indication on the prediction mode used in encoding the current block. A predictor selector 707, 807 within the pix& predictor 704, 804 may determine that the current block to be decoded is an enhancement layer block. Hence, the predictor selector 707, 807 may select to use information from a corresponding block on another layer such as the base layer to filter the base layer prediction block while decoding the current enhancement layer block. An indication that the base layer prediction block has been filtered before using in the enhancement layer prediction by the encoder may have been received by the decoder wherein the pixel predictor 704, 804 may use the indication to provide the reconstructed base layer block values to the filter 708, 808 and to determine which kind of filter has been used, e.g. the SAO fiRer and/or the adaptive loop filter, or there may be other ways to determine whether or not the modified decoding mode should be used.

The predictor selector may output a predicted representation of an image block P'to a first combiner 709. The predicted representation of the image block is used in coniunction with the reconstructed prediction error signal D' to generate a preliminary reconstructed image l'. The preliminary reconstructed image may be used in the predictor 704, 804 or may be passed to a filter 708, 808. The filter applies a filtering which outputs a final reconstructed signal R'. The final reconstructed signal R' may be stored in a reference frame memory 706, 806, the reference frame memory 706, 806 further being connected to the predictor 707, 807 for prediction operations.

The prediction error decoder 702, 802 receives the output of the entropy decoder 700, 800, A dequantizer 702, 802 of the prediction error decoder 702, 802 may dequantize the output of the entropy decoder 700, 800 and the inverse transform block 703, 803 may perform an inverse transform operation to the dequantized signal output by the dequantizer 702, 802, The output of the entropy decoder 700, 800 may also indicate TflO.3O' that prediction error signal is not to be apped and in this case the prediction error decoder produces an S zero output signal.

It should be understood that for various blocks in Figure 5a inter1ayer prediction may be appfled, even if it is not illustrated in Figure 5a. lnteriayer prethction may include sampIe prediction and/or syntax/parameter prediction. For example, a reference picture from one decoder section (e.g. RFM 706) may be used for sample prediction of the other decoder section (e.g. block 807), In another example, syntax elements or parameters from one decoder section (e.g. filter parameters from block 708) may be used for syntax/parameter prediction of the other decoder section (e.g. block 808).

In some embodiments the views may be coded with another standard other than H.264/AVC or HEVC.

Fig. I shows a bbck diagram of a video coding system according to an example embodiment as a schematic block diagram of an exemplary apparatus or electronic device 50, which may incorporate a codec according to an embodiment of the invention.

Fig. 2 shows a layout of an apparatus according to an example embodiment. The elements of Figs. I and 2 wifl he explained next.

The electronic device 50 may for example he a mobile terminal or user equipment of a wireless communication system. However, it would be appreciated that embodiments of the invention may be implemented within any electronic device or apparatus which may require encoding and decoding or encoding or decoding video images.

The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 further may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may he any suitable display Lechnology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interlace mechanism may be employed. For example the user interlace may be 1*' -A,) 1?" 0- OC implemented as a virtual keyboard or data entry system as part of a touchsensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or anSogue signal input. The apparatus 50 may further comprise an audio output device which in embodimenLs of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobUe energy device such as solar cefl, fuel cefl or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video, In some embodiments the apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. In other embodiments the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a UsBftirewire wired connection.

The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The controller 56 may be connected to memory 58 which in embodiments of the invention may store both data in the form of image and audio data and/or may also store instruchons for implementation on the controller 56. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of audio and/or video data or assisting in coding and decoding carried out by the controller 56.

The apparatus 50 may further comprise a card reader 48 and a smart card 46. for example a UICC and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.

The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a ceHular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an KTf!oW C) C antenna 44 connected to the radio interface circthtry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).

In sonic embodiments of the invention, the apparatus 50 comprises a camera capable of recording or detecflng individual frames which are then passed to the codec 54 or controfler for processing. In some embodiments of the invenUon, the apparatus may receive the video image data for processing from another device prior to transmission and/or storage. In some embodiments of the invention, the apparatus 50 may recehie either wirelessly or by a wired connection the image for coding/decoding.

Fig. 3 shows an arrangement for video coding comprising a pluraflty of apparatuses, networks and network elements according to an example embodiment. With respect to Figure 3, an example of a system within which embodiments of the present invention can be utUized is shown. The system 10 comprises multiple communication devices which can communicate through one or more networks. The system 10 may comprise any combination of wired or wireless networks including, but not limited to a wireless cellular telephone network (such as a GSM, UMIS, CDMA network etc), a wireless local area network (WLAN) such as defined by any of the EEE 802.x standards, a Bluetooth personal area network, an Ethernet local area network, a token ring local area network, a wide area net'rk, and the Internet.

The system 10 may include both wired and wireless communication devices or apparatus 50 suitable for implementing embodiments of the invention. For example, the system shown in Figure 3 shows a mobile telephone network 11 and a representation of the internet 28, Connectivity to the internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections induding, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways. ill

The example communication devices shown in the system 10 may include, but are not limited to. an electronic device or apparatus 50, a combination of a personal thgftal assistant (PDA) and a mobile telephone 14, a PDA 1$, an integrated messaging device (IMO) 18, a desktop computer 20, a notebook computer 22. The apparatus 50 may be stationary or mobile when carried by an individual who is moving. The apparatus 50 may also be located in a mode of transport including, but not limited to, a car, a truck, a taxi, a bus, a train, a boat, an airplane, a bicycle, a motorcycle or any similar suitable mode of transport.

Some or further apparatuses may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the internet 28. The system may include additional communication devices and communication devices of various types.

The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FOMA), transmission control protocokinternet protocol (TCPlP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth, IEEE 802.11 and any similar wireless communication technology. A communications device involved in implementing various embodiments of the present invention may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection.

In the above, some embodiments have been described in relation to particular types of parameter sets. It needs to he understood, however, that embodiments could be realized with any type of parameter set or other syntax structure in the bitstream.

T C' , r$ In the above, some embodiments have been described in relation to encoding indications, syntax Sements, and/or syntax structures into a bitstream or into a coded video sequence and/or decoding indications, syntax Sernents, and/or syntax structures from a bitstream or from a coded video sequence. It needs to be understood, however, that embodiments could be realized when encoding indications, syntax elements, and/or syntax structures into a syntax structure or a data unit that is external from a bitstream or a coded video sequence comprising video coding layer data, such as coded sUces, and/or decoding inthcations, syntax elements, and/or syntax structures from a syntax structure or a data unit that is external from a bitstream or a coded video sequence comprising video coding layer data, such as coded slices. For example, in some embodiments, an indication according to any embodiment above may be coded into a video parameter set or a sequence parameter set, which is conveyed externafly from a coded video sequence for example using a control protocol, such as SDP. Continuing the same example, a receiver may obtain the video parameter set or the sequence parameter set, for example using the control protocol, and provide the video parameter set or the sequence parameter set for decoding.

In the above, the example embodiments have been described with the help of syntax of the bitstream. It needs to be understood, however, that the corresponding structure and/or computer program may reside at the encoder for generating the bitstream and/or at the decoder for decoding the bitstream. Likewise, where the example embodiments have been described with reference to an encoder, t needs to be understood that the resulting bitstream and the decoder have corresponding elements in them. Likewise, where the example embodiments have been described with reference to a decoder, it needs to be understood that the encoder has structure and/or computer program for generating the bitstream to be decoded by the decoder.

In the above, some embodiments have been described with reference to a depth view component. It needs to be understood that term depth picture could have been used instead of term depth view component. Likewise, in the above, some embodiments cnn cc have been described with reference to a depth picture. It needs to be understood that term depth view component could have been used instead of term depth picture.

In the above, some embodiments have beer) described with reference to an enhancement view and a base view. It needs to be understood that the base view may as weU be any other view as long as it is a reference view for the enhancement view, ft also needs to be understood that term enhancement view may indicate any non-base view and need not indicate an enhancement of picture or video quality of the enhancement view when compared to the picture/video quality of the baselreference view, It also needs to be understood that the encoder may generate more than two views into a bitstream and the decoder may decode more than two views from the bitstream. Embodiments could be realized with any pair of an enhancement view and its reference view. Likewise, many embodiments could be realized with consideration of more than two views.

It needs to be understood that the encoder may generate more than two views into a bitstream and the decoder may decode more than two views from the bitstream.

Embodiments could be realized with consideration of more than two views.

In the above, some embodiments have been described with reference to an enhancement layer and a reference layer, where the reference layer may be for example a base layer.

In the above, some embodiments have been described with reference to an enhancement view and a reference view, where the reference view may be for example a base view, In the above, some embodiments have been described with reference to motion information prediction. ft needs to he understood that embodiments could be realized by applying motion information inheritance rather than motion information prediction.

In the above, some embodiments have been described with reference to a block or blocks, where the blocks may be selected in various ways. For example, the block may be a unit for motion prediction, i.e. a block that has its own motion information associated with it, such as a prediction unit (PU) in HEVC. In another example, the block may be a unit for storing motion information for a decoded reference picture.

Embodiments may be reaUzed with different selection of the unit for a block. Moreover, within an embodiment a different selection of the unit for a block may he applied for different blocks that the embodiment refers to.

In the above, some embodiments have been described for multiview video coding. It needs to be understood that embodiments may similarly be applicable to other types of layered coding, for example for quality scalability and for multiview video plus depth coding.

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a tomputer-readable medium" may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer, with one example of a computer described and depicted in Figures 1 and 2.

A computer-readable medium may comprise a computer-readable storage medium that may be any media or means that can contain or store the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above described functions may be optional or may be combined. or

Although the above examples describe embodiments of the invention operating within a codec wfthin an electronic device, it would be appreciated that the invent on as described below may be implemented as part of any video codec. Thus, for example, embodiments of the invention may be implemented in a video codec which may implement video coding over fixed or wired communicafion paths.

Thus, user equipment may comprise a video codec such as those described in embodiments of the invenUon above. It shaH he appreciated that the term user equipment is intended to cover any suitable type of wir&ess user equipment, such as mobile telephones, portable data processing devices or portable web browsers.

Furthermore elements of a pubUc land mobe network (PLMN) may also comprise video codecs as described above.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may he executed by a controfler, microprocessor or other computing device, although the invention is not limited thereto.

WhUe various aspects of the invention may be Ulustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is wefl understood that these blocks, apparatuses, systems, techniques or methods described herein may be implemented in, as nonlimiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controfler or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobHe device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a terminal device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the terminal device to carry out the features of an embodimenL Yet further, a network device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment, The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductorbased memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of genera purpose computers, special purpose computers, microprocessors, digital signal processors (DSP5) and processors based on multicore processor architecture, as

non4miting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules, The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys Inc., of Mountain View, CaUfornia and Cadence Design, of San Jose, Callfornia automatically route conductors and ocate components on a semiconductor chip using well establlshed rules of design as well as libraries of pr&stored design modules. Once the design for a semiconductor circuit has been completed, the resuftant design, in a standarthzed &ectronic format (e.g., Opus, GDSH, or the like) may be transmitted to a semiconductor fabrication facility or Tab for fabrication, The foregoing descripUon has provided by way of exemplary and nonUmiting examples a full and informative descripUon of the exemplary embodiment of this invention.

However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing descripfion, when read in conjunction with the accompanying drawings and the appended claims. However, all such and simar modifications of the teachings of this invention will still fall within the scope of this invention.

Claims

Claims: 1. A method comprising: determining ranging information for a set of views; determining a unit associated with the set of views; determining the validity of the ranging information for the unit; and generating at least one indicator based on the validity of the ranging information.
2. The method as daimed in claim I wherein, determining the unit associated with the set of views comprises determining at least one of: a subset of views as the unit; a set of pictures within each view in the set of views as the unit; and a spatial region within at east one picture within the set of pictures as the unit.
3. The method as claimed in claims 1 and 2, wherein determining the validity of the ranging information for the unit further comprises at least one of: determining the ranging information for the associated unit is not output from a decoding process; determining the ranging information for the associated unit is marked unavailable orinvaRd; determining a hole fifing process is to be employed for the ranging information for the associated unit; determining the ranging information for the associated unit is not to be used for depthiniage based rendering; determining the ranging information for the associated unit can be replaced with an output of a depth estimation algorithm from decoded texture views in a decoder section; determining the ranging information for the associated unit is unavailable or invalid; determining the ranging information for the associated unit is partiafy unavailable or invalid with respect to a first portion of the views; determining the first portion of the views for which ranging information for the associated unit is partiafly unavailable or invalid; determining the ranging information for the associated unit is partially avaflable or valid with respect to a second portion of the views; determining the second portion of the views for which ranging information for the associated unit is partiafly available or valid; determining the ranging information for the associated unit is unavailable or invaUd for a first at least one processing operation; determining the first at least one processing operation for which the ranging information for the associated unit is unavailable or invalid; determining the ranging information for the associated unit is available or valid for a second at least one processing operation; determining the second at east one processing operation for which the ranging information for the associated unit is available or valid; and determining the quality of the ranging information for the associated unit is below a determined threshold.
4. The method as claimed in claims I to 3, wherein generating at least one indicator based on the v&idity of the ranging information comprises at east one of: generating at east one indicator based on the validity of the ranging information within a sequence parameter set; generating at east one indicator based on the validity of the ranging information within a picture parameter set; generating at east one indicator based on the validity of the ranging information within a supplemental enhancement information (SEl) message; and generating at least one indicator based on the validity of the ranging information within a slice header.
5, The method as claimed in claim 4, wherein generating at least one indicator based on the validity of the ranging information within a sequence parameter set comprises generating a sequence parameter such as _________________ C Descriptor

___ __________ i.wher&n the depth output flag is the at least one indicator,
6. The method as claimed in claims I to 5, wherefri generating at east one indicator based on the validity of the ranging information comprises at least one of: generating the at least one indicator based on a scope of the valithty of the ranging information; and generating the at least one indicator based on a scope of the persistence of the ranging information.
7. The method as claimed in claims I to 6, wherein generating at least one indicator based on the validity of the ranging information comprises generating a map of indicators defining which ranging information areas are avaflable and which areas are not.
8, The method as claimed in claim 7, wherein generating a map of indicators comprises generating a quadtree structure map of indicators.
9. The method as claimed in daims 1 to 8, further comprising controlling the encoding of the ranging information based on the validity of the ranging information.
10. The method as claimed in claim 9, wherein generating at least one indicator based on the validity of the ranging information comprises at least one indicator associated with the controfling the encoding of the ranging information based on the validity of the ranging information.
11. A method comprising: receiving a signal comprising at east one indicator wherein the at least one indicator is based on ranging information vaUdity; and controlling a utilization of ranging information based on the at least one indicator,
12. A method as claimed in claim 11, wherein the signal comprises the ranging information for a set of views and the method further comprises: determining or obtaining a unit associated with the set of views; determining the ranging information validity for the unit based on the at least one indicator.
13. A method as claimed in claim 12, wherein said controlling the utilization of ranging information comprises at east one of: omitting the output of the ranging information for the unit from a decoding process; outputting the ranging information for the unit from the decoding process; omitting a first at east one processing operation using the ranging information for the unit; and performing a second at least one processing operation using the ranging information for the unit.
14. The method as claimed in claim 12 and 13. wherein the ranging information for a set of views comprises invalid or unavailable ranging information, and wherein the method further comprises replacing the invalid or unavailable ranging information based on the at east one indicator.
15. The method as claimed in claim 14, wherein the at least one indicator comprises at east one of: at least one indicator that the ranging information for the associated unit is parUafly unavailable or invalid with respect to a first porflon of the views; at least one indicator of the first portion for which ranging information for the associated unit is parflafly unavailable or invalid; at least one indicator that the ranging information for the associated unit is partiay avaHabte or vaHd with respect to a second portion of the views; and at east one indicator of the second portion for which ranging information for the associated unft is partiaUy avaUable or valid.
16. The method as claimed in claim 12 to 15 wherein, determining or obtaining the unit associated with the set of views comprises determining at least one of: a subset of views as the unit; a set of pictures within each view in the set of views as the unit; and a spatial region within at least one picture within the set of pictures as the unit.
17. The method as claimed in claims 11 to 16, wherein controfling a utilization of ranging information based on the at least one indicator comprises performing at least one processing operation on the at least partial ranging information for a set of views for at least one processing operation based on the at least one indicator, wherein the at least one indicator comprises at least one of: at least one indicator that the ranging information for the associated unit is unavailable or invalid for at least one processing operation; at least one indicator indicating the at least one processing operation for which the ranging information for the associated unit is unavailable or invalid for; at least one indicator that the ranging information for the associated unit is available or valid for at least one processing operation; and at least one indicator indicating the at least one processing operation for which the ranging information for the associated unit is available or valid for.
18. The method as claimed in claims 11 to 17, wherein the at least one indicator comprises at least one of: at least one indicator indicating the ranging information is not output from a decoding process: at least one indicator indicating the ranging information is marked unavailable or invalid; at east one indicator indicating a hde filling process is to be employed for the ranging information; at least one indicator indicating the ranging information is not to be used for depth-image based rendering; at east one indicator indicating the ranging information can be replaced with an output of a depth estimation algorithm from decoded texture views: at least one indicator indicating the ranging information is unavailable or invalid; at least one indicator indicating the ranging information is partially unavailable or invalid with respect to a first portion of the views; at least one indicator indicating the first portion for which ranging information is partially unavailable or invalid; at least one indicator indicating the ranging information is partially available or valid with respect to a second portion of the views; at least one indicator indicating the second portion for which ranging information for the associated unit is partially available or; at least one indicator indicating the ranging information is unavailable or invalid for a first at least one processing operation; at least one indicator indicating the first at least one processing operation for which the ranging information for the associated unit is unavailable or invalid; at least one indicator indicating the ranging information is available or valid for a second at least one processing operation; at least one indicator indicating the second at least one processing operation for which the ranging information for the associated unit is available or valid; and at least one indicator indicating the quality of the ranging information is below a determined threshold.
19. The method as claimed in claims 11 to 18, wherein determining a signal comprising at least one indicator comprises determining at least one of: determining at least one indicator based on the validity of the ranging information within a sequence parameter set; determining at least one indicator based on the vaUdfty of the ranging informatbn within a picture parameter set; determinkig at least one inthcator based on the vadfty of the ranging information within a supplemental enhancement informafion (SE!) message; and determining at least one indicator based on the vadity of the ranging information within a sce header.
20. The method as ciafrned in claims 11 to 19, wherein coritrolhng a utiUzation of ranging information based on the at least one indicator comprises at least one of: controlling a utHization of ranging information based on a scope of the vaUdity of the ranging information; and controlling a utUization of ranging nformation based on a scope of the persistence of the ranging information.
21. The method as claimed in daims 11 to 20, wherein determining a &gnS comprising at least one indicator comprises determining a map of indicators defining which ranging information areas are available and which areas are not.
22. An apparatus comprising at least one processor and at east one memory ndudng computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to: determine ranging information for a set of views; determine a unit associated with the set of views; determine the validity of the ranging information for the unit; and generate at least one indicator based on the validity of the ranging information.
23. An apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to with the at least one processor cause the apparatus to: receive a signal comprising at east one indicator wherein the at least one indicator is based on ranging information vahdity; and 19! contro a utUization of ranging information based on the at east one indicator.
24. An apparatus comprising: means for determining ranging information for a set of views; means for determining a unit associated with the set of views; means for determining the validity of the ranging information for the unit; and means for generating at least one indicator based on the vaUdity of the ranging information.
25. An apparatus comprising: means for receiving a signal comprising at least one indicator wherein the at east one indicator is based on ranging information vafidity; and means for controffing a utiUzation of ranging information based on the at east one indicator.
26. An apparatus comprising: a depth data generator configured to determine ranging information for a set of views; a unit determiner configured to determine a unit associated with the set of views; a depth data vaUdity determiner configured to determine the validity of the ranging information for the unit; and an indicator inserter configured to generate at least one indicator based on the vaUdity of the ranging information.
27. An apparatus comprising: an indicator detector configured to receive a signal comprising at least one indicator wherein the at least one indicator is based on ranging information validity; and a decoder controller may be configured to control a utilization of ranging information based on the at east one indicator,