WO2017158236A2

WO2017158236A2 - A method, an apparatus and a computer program product for coding a 360-degree panoramic images and video

Info

Publication number: WO2017158236A2
Application number: PCT/FI2017/050167
Authority: WO
Inventors: Miska Hannuksela; Alireza Aminlou; Ramin GHAZNAVI YOUVALARI
Original assignee: Nokia Technologies Oy
Priority date: 2016-03-15
Filing date: 2017-03-14
Publication date: 2017-09-21
Also published as: GB201604346D0; GB2548358A; WO2017158236A3

Abstract

A method and an apparatus for encoding and decoding 360-degree panoramic images and video. A non-rectangular effective picture area representing 360-degree panorama picture is determined; a mapped 360-degree panorama picture covering the effective picture area is obtained; a non-rectangular and at least partially block-aligned reshaped picture from the mapped 360-degree panorama picture is obtained using a first processing. The first processing comprises identifying a boundary block containing a boundary of the effective picture area; removing zero or more samples of the boundary block; setting sample values of samples in the boundary block and outside the effective picture area by copying sample values from an opposite-side boundary region of the effective picture area and/or moving samples from an opposite-side boundary region of the effective picture area; and using a second processing to the reshaped picture to obtain a rectangular picture and encoding the rectangular picture into a bitstream.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR CODING A 360-DEGREE PANORAMIC IMAGES AND VIDEO

TECHNICAL FIELD

The present embodiments relate to coding of 360-degree panoramic images and video. BACKGROUND This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.

360-degree panoramic images and video cover horizontally the full 360-degree field-of-view around the capturing position. 360-degree panoramic video content can be acquired e.g. by stitching pictures of more than one camera sensor to a single 360-degree panoramic image. Also, a single image sensor can be used with an optical arrangement to generate 360-degree panoramic image.

SUMMARY

Some embodiments provide a method and an apparatus for implementing the method for encoding and decoding 360-degree panoramic images and video.

Various aspects of examples of the invention are provided in the detailed description.

According to a first aspect, there is provided a method comprising determining an effective picture area representing 360-degree panorama picture, the effective picture area being non-rectangular; obtaining a mapped 360-degree panorama picture covering the effective picture area; obtaining a reshaped picture from the mapped 360-degree panorama picture using a first processing to at least one boundary block containing a boundary of the effective picture area; the reshaped picture being non-rectangular and at least partially block-aligned, said first processing comprising: identifying a boundary block containing a boundary of the effective picture area; performing one or more of the following; removing zero or more samples of the boundary block and within the effective picture area; setting sample values of samples in the boundary block and outside the effective picture area in one or both of the following ways: copying sample values from an opposite-side boundary region of the effective picture area; moving samples from an opposite-side boundary region of the effective picture area; using a second processing to the reshaped picture to obtain a rectangular picture, the second processing differing from the first processing; encoding the rectangular picture into a bitstream.

According to an embodiment, the effective picture area is determined by a mapping applied to a source picture.

According to an embodiment, said source picture is an equirectangular 360-degree picture.

According to an embodiment, said source picture is a spherical picture.

According to an embodiment, said mapping is a pseudo-cylindrical mapping. According to an embodiment, said mapping is specified as a mathematical function.

According to an embodiment, the boundary of the effective picture area is bilaterally symmetric with a horizontal symmetry axis and/or a vertical symmetry axis. According to an embodiment, the method further comprises encoding one or more indications into the bitstream, the one or more indications indicative of one or more of the following: the second processing; the first processing; the effective picture area; the encoded rectangular picture representing a 360-degree panorama picture. According to an embodiment, said second processing comprises: identifying a second boundary block comprising a boundary of the block-aligned non-rectangular picture; setting sample values of samples adjacent to the second boundary block and outside block-aligned non-rectangular picture in one of the following ways: extrapolating boundary sample values of the second boundary block; and deriving sample values at least partially from the second boundary block.

According to an embodiment, said extrapolating matches with an intra prediction.

According to an embodiment, the intra prediction is a horizontal intra prediction. According to a second aspect, there is provided a method comprising: determining a first effective picture area representing 360-degree panorama picture, the first effective picture area being non-rectangular; determining a second effective picture area representing a reshaped picture that is non-rectangular and at least partly block-aligned; decoding a rectangular picture from a bitstream, the decoded rectangular picture having the second effective picture area; obtaining a 360-degree panorama picture with the first effective picture area from the decoded rectangular picture using first processing comprising: identifying a boundary block in which the first effective picture area and the second effective picture area do not match; processing the boundary block using a first processing, the first processing comprising one or more of the following: interpolating or extrapolating zero or more samples of the boundary block; moving samples from an opposite-side boundary region of the second effective picture area to the boundary block; setting samples outside the first effective picture area to a determined value.

According to an embodiment, the determined value is a value representing black.

According to an embodiment, the first effective picture area is determined by a mapping applied to a source picture.

According to an embodiment, the source picture is an equirectangular 360-degree picture.

According to an embodiment, the source picture is a spherical picture. According to an embodiment, the mapping is a pseudo-cylindrical mapping.

According to an embodiment, the said mapping is specified as a mathematical function.

According to an embodiment, the boundary of the first effective picture area is bilaterally symmetric with a horizontal symmetry axis and/or a vertical symmetry axis.

According to an embodiment, the method further comprises decoding one or more indications from the bitstream, the one or more indications indicative of one or more of the following: the first processing; the first effective picture area; the second effective picture area; the rectangular picture representing a 360- degree panorama picture.

According to a third aspect, there is provided an apparatus comprising at least one processor; at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: determine an effective picture area representing 360-degree panorama picture, the effective picture area being non-rectangular; obtain a mapped 360-degree panorama picture covering the effective picture area; obtain a reshaped picture from the mapped 360-degree panorama picture using a first processing to at least one boundary block containing a boundary of the effective picture area; the reshaped picture being non- rectangular and at least partially block-aligned, said first processing comprising: identifying a boundary block containing a boundary of the effective picture area; performing one or more of the following: removing zero or more samples of the boundary block and within the effective picture area: setting sample values of samples in the boundary block and outside the effective picture area in one or both of the following ways: copying sample values from an opposite-side boundary region of the effective picture area; moving samples from an opposite-side boundary region of the effective picture area; use a second processing to the reshaped picture to obtain a rectangular picture, the second processing differing from the first processing; encode the rectangular picture into a bitstream.

According to an embodiment, the effective picture area is determined by a mapping applied to a source picture. According to an embodiment, said source picture is an equirectangular 360-degree picture. According to an embodiment, said source picture is a spherical picture. According to an embodiment, said mapping is a pseudo-cylindrical mapping.

According to an embodiment, said mapping is specified as a mathematical function.

According to an embodiment, the boundary of the effective picture area is bilaterally symmetric with a horizontal symmetry axis and/or a vertical symmetry axis.

According to an embodiment, the apparatus further comprises computer program code to cause the apparatus to encode one or more indications into the bitstream, the one or more indications indicative of one or more of the following: the second processing; the first processing; the effective picture area; the encoded rectangular picture representing a 360-degree panorama picture.

According to an embodiment, said second processing comprises: identifying a second boundary block comprising a boundary of the block-aligned non-rectangular picture; setting sample values of samples adjacent to the second boundary block and outside block-aligned non-rectangular picture in one of the following ways: extrapolating boundary sample values of the second boundary block; and deriving sample values at least partially from the second boundary block. According to an embodiment, said extrapolating matches with an intra prediction.

According to an embodiment, the intra prediction is a horizontal intra prediction. According to a fourth aspect, there is provided an apparatus comprising at least one processor; at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: determine a first effective picture area representing 360-degree panorama picture, the first effective picture area being non-rectangular; determine a second effective picture area representing a reshaped picture that is non-rectangular and at least partly block-aligned; decode a rectangular picture from a bitstream, the decoded rectangular picture having the second effective picture area; obtain a 360-degree panorama picture with the first effective picture area from the decoded rectangular picture using first processing comprising: identifying a boundary block in which the first effective picture area and the second effective picture area do not match; processing the boundary block using a first processing, the first processing comprising one or more of the following: interpolating or extrapolating zero or more samples of the boundary block; moving samples from an opposite-side boundary region of the second effective picture area to the boundary block; setting samples outside the first effective picture area to a determined value.

According to an embodiment, the determined value is a value representing black color.

According to an embodiment, the source picture is a spherical picture.

According to an embodiment, the mapping is a pseudo-cylindrical mapping. According to an embodiment, the said mapping is specified as a mathematical function.

According to an embodiment, the boundary of the first effective picture area is bilaterally symmetric with a horizontal symmetry axis and/or a vertical symmetry axis. According to an embodiment, the apparatus further comprises computer program code to cause the apparatus to decode one or more indications from the bitstream, the one or more indications indicative of one or more of the following: the first processing; the first effective picture area; the second effective picture area; the rectangular picture representing a 360-degree panorama picture.

According to a fifth aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: determine an effective picture area representing 360- degree panorama picture, the effective picture area being non-rectangular; obtain a mapped 360-degree panorama picture covering the effective picture area; obtain a reshaped picture from the mapped 360-degree panorama picture using a first processing to at least one boundary block containing a boundary of the effective picture area; the reshaped picture being non-rectangular and at least partially block-aligned, said first processing comprising: identifying a boundary block containing a boundary of the effective picture area; performing one or more of the following: removing zero or more samples of the boundary block and within the effective picture area: setting sample values of samples in the boundary block and outside the effective picture area in one or both of the following ways: copying sample values from an opposite-side boundary region of the effective picture area; moving samples from an opposite-side boundary region of the effective picture area; use a second processing to the reshaped picture to obtain a rectangular picture, the second processing differing from the first processing; encode the rectangular picture into a bitstream.

According to a sixth aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: determine a first effective picture area representing 360- degree panorama picture, the first effective picture area being non-rectangular; determine a second effective picture area representing a reshaped picture that is non-rectangular and at least partly block-aligned; decode a rectangular picture from a bitstream, the decoded rectangular picture having the second effective picture area; obtain a 360-degree panorama picture with the first effective picture area from the decoded rectangular picture using first processing comprising: identifying a boundary block in which the first effective picture area and the second effective picture area do not match; processing the boundary block using a first processing, the first processing comprising one or more of the following: interpolating or extrapolating zero or more samples of the boundary block; moving samples from an opposite-side boundary region of the second effective picture area to the boundary block; setting samples outside the first effective picture area to a determined value.

BRIEF DESCRIPTION OF THE DRAWINGS For a more complete understanding of example embodiments of the present invention, reference is now made to the following descriptions taken in connection with the accompanying drawings in which: Figure 1 illustrates a block diagram of a video coding system according to an embodiment; Figure 2 illustrates a layout of an apparatus according to an embodiment;

Figure 3 illustrates an arrangement for video coding comprising a plurality of

apparatuses, networks and network elements according to an example embodiment;

Figure 4 illustrates a block diagram of a video encoder according to an embodiment;

Figure 5 illustrates a block diagram of a video decoder according to an embodiment;

Figure 6 illustrates an example of a pseudo-cylindrical spherical image on a rectangular block grid;

Figure 7 illustrates an example of selecting a boundary block for boundary block removal;

Figure 8 illustrates an example of a reshaped picture resulting from the removal of the boundary block of Figure 7;

Figure 9 illustrates an example of a reshaped picture resulting from the copying of the right side boundary area to fill the left-side boundary block;

Figure 10 illustrates an example of a reshaped picture resulting from the moving of the right side boundary area to the fill in a left-side boundary block;

Figure 11 illustrates an example of a reshaped picture resulting from the moving of the right side boundary area so that it becomes block aligned;

Figure 12 is a flowchart illustrating an encoding method according to an embodiment; and

Figure 13 is a flowchart illustrating a decoding method according to another embodiment. DETAILED DESCRIPTION OF SOME EXAMPLE EMBODIMENTS

The present application relates to 360-panoramic video content, the amount of which is rapidly increasing due to dedicated devices and software for capturing and/or creating 360-panoramic video content. An embodiment of an apparatus for capturing and/or creating 360-panoramic video content is illustrated in Figures 1 and 2. The apparatus 50 is an electronic device for example a mobile terminal or a user equipment of a wireless communication system or a camera device. The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 further may comprise a display 32, for example, a liquid crystal display or any other display technology capable of displaying images and/or videos. The apparatus 50 may further comprise a keypad 34. According to another embodiment, any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device, which may be any of the following: an earpiece 38, a speaker or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (according to another embodiment, the device may be powered by any suitable mobile energy device, such as solar cell, fuel cell or clockwork generator). The apparatus may comprise a camera 42 capable of recording or capturing images and/or video, or may be connected to one. The camera 42 may be capable of capturing a 360-degree field-of-view horizontally and/or vertically for example by using a parabolic mirror arrangement with a conventional two-dimensional color image sensor or by using several wide field-of-view lenses and/or several color image sensors. The camera 42 or the camera to which the apparatus is connected may in essence comprise of several cameras. According to an embodiment, the apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. According to an embodiment, the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/firewire wired solution.

The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus. The controller 56 may be connected to memory 58 which, according to an embodiment, may store both data in the form of image and audio data and/or may also store instructions for implementation on the controller 56. The controller 56 may further be connected to video codec circuitry 54 suitable for carrying out coding and decoding or audio and/or video data or assisting in encoding and/or decoding carried out by the controller 56.

A video codec circuitry 54 may comprise an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder is able to uncompress the compressed video representation back into a viewable form. The encoder may discard some information in the original video sequence in order to represent the video in more compact form (i.e. at lower bitrate). Figure 4 illustrates an example of a video encoder, where I„: Image to be encoded; P'„: Predicted representation of an image block; D„: Prediction error signal; D'„: Reconstructed prediction error signal; I'„: Preliminary reconstructed image; R'„: Final reconstructed image ; T, T.i: Transform and inverse transform; Q, Q.i: Quantization and inverse quantization; E: Entropy encoding; RFM: Reference frame memory; Pinter: Inter prediction; Pint™: Intra prediction; MS: Mode selection; F: Filtering. Figure 5 illustrates a block diagram of a video decoder where P'„: Predicted representation of an image block; D'n: Reconstructed prediction error signal; I'„: Preliminary reconstructed image; R'„: Final reconstructed image; T.i: Inverse transform; Q.i: Inverse quantization; E.i: Entropy decoding; RFM: Reference frame memory; P: Prediction (either inter or intra); F: Filtering. In some embodiments, the apparatus 50 (Figures 1 and 2) comprises only an encoder or a decoder, is some other embodiments the apparatus 50 comprises both.

Referring again to Figures 1 and 2. The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).

According to an embodiment, the apparatus 50 comprises a camera 42 capable of recording or detecting individual frames which are then passed to the codec 54 or controller for processing. According to an embodiment, the apparatus may receive the video image data for processing from another device prior to transmission and/or storage. According to an embodiment, the apparatus 50 may receive the images for processing either wirelessly or by a wired connection.

Figure 3 shows a system configuration comprising a plurality of apparatuses, networks and network elements according to an embodiment. The system 10 comprises multiple communication devices which can communicate through one or more networks. The system 10 may comprise any combination of wired or wireless networks including, but not limited to a wireless cellular telephone network (such as a GSM, UMTS, CDMA network, etc.), a wireless local area network (WLAN), such as defined by any of the IEEE 802.x standards, a Bluetooth personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and the internet.

The system 10 may include both wired and wireless communication devices or apparatus 50 suitable for implementing present embodiments. For example, the system shown in Figure 3 shows a mobile telephone network 11 and a representation of the internet 28. Connectivity to the internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.

The example communication devices shown in the system 10 may include but are not limited to, an electronic device or apparatus 50, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (EVID) 18, a desktop computer 20, a notebook computer 22, a digital camera 12. The apparatus 50 may be stationary or mobile when carried by an individual who is moving. The apparatus 50 may also be located in a mode of transport. Some of further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the internet 28. The system may include additional communication devices and communication devices of various types. The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telephone system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocol- internet protocol (TCP-ΓΡ), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth, IEEE 802.11 and any similar wireless communication technology. A communications device involved in implementing various embodiments of the present invention may communicate using various media including, but not limited to, radio infrared, laser, cable connections or any suitable connection.

As mentioned, the present embodiments relate to coding of pseudo-cylindrically projected spherical images. By the market availability of Virtual Reality (VR) devices, the need for wide field-of-view (FOV) content is increasing. The most common devices for displaying the VR images/videos are by using Head Mounted Displays (HMD). An HMD device may be considered to consist of a binocular display in front of user's eyes. A motion sensor has been embedded to the device in order to provide suitable field-of-view based on user's head motion. In order to cover the whole scene, 360-degree field of view content can be used.

360-degree panoramic content (i.e., images and video) cover horizontally the full 360-degree field-of-view around the capturing position of an imaging device (e.g. a camera or an apparatus of Figure 1). The vertical field-of-view may vary and can be e.g. 180 degrees. Panoramic image covering 360-degree field-of-view horizontally and 180-degree field-of-view vertically can be represented by a sphere that has been mapped to a two-dimensional image plane using equirectangular projection. In this case, the horizontal coordinate may be considered equivalent to a longitude, and the vertical coordinate may be considered equivalent to a latitude, with no transformation or scaling applied. In some cases panoramic content with 360-degree horizontal field-of-view but with less than 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two- dimensional image plane. In some cases a panoramic image may have less than 360-degree horizontal field- of-view and up to 180-degree vertical field-of-view, while otherwise has the characteristics of equirectangular projection format.

360-degree panoramic video content can be acquired by various means. For example, the pictures of more than one camera sensor can be stitched to a single 360-degree panoramic image. There are dedicated devices and camera rigs on the market for implementing this. Yet, as another non-limiting example, a 360-degree panoramic video content can be acquired by a single image sensor with an optical arrangement.

360-degree panoramic images can be captured by various means. One of the most common devices for 360- degree field-of-view images are by stitching images together from multicamera setup, catadioptric devices and stitching multiple fisheye images.

One of the methods to create 360-degree panoramic image is by using catadioptric devices. These devices consists of conventional lens and a mirror. The mirror can have different convex shapes, e.g. parabolic, planar or spherical. The lens captures the light rays reflected through the mirror which provides an image with very high field-of-view.

Another method for 360-degree content creation is by using fisheye lenses. Due to the very limited field- of-view of conventional cameras, the proper way to create 360-degree images is by means of cameras which have wide field-of-views. Fisheye lenses are covering very high field-of-view (usually more than 180- degree). These lenses capture the light waves by using refraction effect, since the light waves which have higher angles of incident (in the image edges) are more curved in the image. One can create 360-degree view by using projection and stitching techniques with only few fisheye images. The fisheye lenses due to their capturing structure introduce some distortions to the acquired images. The straight lines in the image boundaries become curved and this artifact makes the content not very suitable to use in display devices e.g. HMDs (Head Mounted Display). Usually some projections are used in order to correct the distortion and create 360-degree images by stitching different views. A family of pseudo-cylindrical projections attempt to minimize the distortion of the polar regions of the cylindrical projections, such as the equirectangular projection, by bending the meridians toward the center of the map as a function of longitude while maintaining the cylindrical characteristic of parallel parallels. Pseudo-cylindrical projections result into non-rectangular contiguous 2D images representing the projected sphere. However, it is possible to present pseudo-cylindrical projections in interrupted forms that are made by joining several regions with appropriate central meridians and false easting and clipping boundaries. Pseudo-cylindrical projections may be categorized based upon the shape of the meridians to sinusoidal, elliptical, parabolic, hyperbolic, rectilinear and miscellaneous pseudo-cylindrical projections. An additional characterization is based upon whether the meridians come to a point at the pole or are terminated along a straight line (in which case the projection represents less than 180 degrees vertically). Certain pseudo-cylindrical projections result into an ellipse with ratio of 2:1, but generally pseudo- cylindrical projections may result into pictures of another non-rectangular shape. The benefits of this pseudo-cylindrical projections over cylindrical projections include that they preserve the image content locally and avoid over stretching of polar areas. Moreover, images are represented by fewer pixels compared to respective cylindrically projected images (e.g. equirectangular panorama images) due to the fact that polar areas are not stretched.

The present embodiments are designed to operate with pseudo-cylindrically projected spherical images. However, the present embodiments can be applied also with other projections having the following characteristics: i) projected images are non-rectangular; and ii) a first boundary pixel on a first boundary of a projected image is, in the spherical domain, adjacent to a second boundary pixel at the opposite boundary of the projected image.

An example of another projection that can be used with the present embodiments is a stereographic projection of a sphere. A physical model of stereographic projections is to imagine a transparent sphere sitting on a plane. If one calls a point at which the sphere touches the plane at the south pole, then a light source can be placed at the north pole. Each ray from the light passes through a point on the sphere and then strikes the plane. This is the stereographic projection of the point on the sphere. The present embodiments can also be applied to image(s) resulting from a combination of projections where the combination has the following characteristics: i) projected images are non-rectangular; and ii) a first boundary pixel on a first boundary of a projected image is, in the spherical domain, adjacent to a second boundary pixel at the opposite boundary of the projected image. For example, a pseudo-cylindrical spherical projection can be applied to the top and bottom parts of the picture and an equirectangular projection can be applied to the vertically middle part of the picture. For example, the present embodiments can be applied to picture(s) where the vertically top quarter is pseudo-cylindrically projected, the vertically bottom quarter is also pseudo-cylindrically projected, and the vertically middle part comprises an equirectangular panorama. Video coding of related art is described next.

The H.264/AVC standard was developed by the Joint Video Team (JVT) of the Video Coding Experts Group (VCEG) of the Telecommunications Standardization Sector of International Telecommunication Union (ITU-T) and the Moving Picture Experts Group (MPEG) of International Organization for Standardization (ISO) / International Electrotechnical Commission (IEC). The H.264/AVC standard is published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.264 and ISO/IEC International Standard 14496-10, also known as MPEG-4 Part 10 Advanced Video Coding (AVC). There have been multiple versions of the H.264/AVC standard, integrating new extensions or features to the specification. These extensions include Scalable Video Coding (SVC) and Multiview Video Coding (MVC).

Version 1 of the High Efficiency Video Coding (H.265/HEVC a.k.a. HEVC) standard was developed by the Joint Collaborative Team - Video Coding (JCT-VC) of VCEG and MPEG. The standard was published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.265 and ISO/IEC International Standard 23008-2, also known as MPEG-H Part 2 High Efficiency Video Coding (HEVC). Version 2 of H.265/HEVC included scalable, multiview, and fidelity range extensions, which may be abbreviated SHVC, MV-HEVC, and REXT, respectively. Version 2 of H.265/HEVC was published as ITU-T Recommendation H.265 (10/2014) and as Edition 2 of ISO/IEC 23008-2. There are currently ongoing standardization projects to develop further extensions to H.265/HEVC, including three- dimensional and screen content coding extensions, which may be abbreviated 3D-HEVC and SCC, respectively.

SHVC, MV-HEVC, and 3D-HEVC use a common basis specification, specified in Annex F of the version 2 of the HEVC standard. This common basis comprises for example high-level syntax and semantics e.g. specifying some of the characteristics of the layers of the bitstream, such as inter-layer dependencies, as well as decoding processes, such as reference picture list construction including inter-layer reference pictures and picture order count derivation for multi-layer bitstream. Annex F may also be used in potential subsequent multi-layer extensions of HEVC. It is to be understood that even though a video encoder, a video decoder, encoding methods, decoding methods, bitstream structures, and/or embodiments may be described in the following with reference to specific extensions, such as SHVC and/or MV-HEVC, they are generally applicable to any multi-layer extensions of HEVC, and even more generally to any multilayer video coding scheme.

Many hybrid video codecs, including H.264/AVC and HEVC, encode video information in two phases. In the first phase, predictive coding is applied for example as so-called sample prediction and/or so-called syntax prediction.

In the sample prediction, pixel or sample values in a certain picture area or "block" are predicted. These pixel or sample values can be predicted, for example, using one or more of the following ways:

Motion compensation mechanisms (which may also be referred to as temporal prediction or motion-compensated temporal prediction or motion-compensated prediction or MCP or inter prediction), which involve finding and indicating an area in one of the previously encoded video frames that corresponds closely to the block being coded.

Intra prediction, where pixel or sample values can be predicted by spatial mechanism which involve finding and indicating a spatial region relationship. More generally, intra prediction can be performed in spatial or transform domain, i.e., either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra coding, where no inter prediction is applied.

In many video codecs, including H.264/AVC and HEVC, motion information is indicated by motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder) or decoded (at the decoder) and the prediction source block in one of the previously coded or decoded images (or picture). H.264/AVC and HEVC, as many other video compression standards, divide a picture into a mesh of rectangles, for each of which a similar block in one of the reference pictures is indicated for inter prediction. The location of the prediction block is coded as a motion vector that indicates the position of the prediction block relative to the block being coded.

One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently if they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.

In the syntax prediction, which may also be referred to as parameter prediction, syntax elements and/or syntax element values and/or variables derived from syntax elements are predicted from syntax elements (de)coded earlier and/or variables derived earlier. Non-limiting examples of syntax prediction are provided below:

In motion vector prediction, motion vectors e.g. for inter and/or inter- view prediction may be coded differentially with respect to a block-specific predicted motion vector. In many video codecs, the predicted motion vectors are created in a predefined way, for example by calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictions, sometimes referred to as advanced motion vector prediction (AMVP), is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index may be predicted from adjacent blocks and/or co-located blocks in temporal reference picture. Differential coding of motion vectors may be disabled across slice boundaries. The block partitioning, e.g. from CTU to CUs and down to PUs, may be predicted.

In filter parameter prediction, the filtering parameters e.g. for sample adaptive offset may be predicted.

Prediction approaches using image information from a previously coded image can also be called as inter prediction methods which may also be referred to as temporal prediction and motion compensation. Inter prediction may sometimes be considered to only include motion-compensated temporal prediction, while it may sometimes be considered to include all types of prediction where a reconstructed/decoded block of samples is used as prediction source, therefore including conventional inter-view prediction for example. Inter prediction may be considered to comprise only sample prediction but it may alternatively be considered to comprise both sample and syntax prediction. As a result of syntax and sample prediction, a predicted block of pixels of samples may be obtained. Prediction approaches using image information with the same image can also be called as intra prediction methods.

The second phase is one of coding the error between the predicted block of pixels or samples and the original block of pixels or samples. This may be accomplished by transforming the difference in pixel or sample values using a specified transform. This transform may be e.g. a Discrete Cosine Transform (DCT) or a variant thereof. After transforming the difference, the transformed difference is quantized and entropy coded. In some coding schemes, an encoder can indicate, e.g. on transform unit basis, to bypass the transform and code a prediction error block in the sample domain.

By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel or sample representation (i.e. the visual quality of the picture) and the size of the resulting encoded video representation (i.e. the file size or transmission bit rate).

The decoder reconstructs the output video by applying a prediction mechanism similar to that user by the encoder in order to form a predicted representation of the pixel or sample blocks (using the motion or spatial information created by the encoder and included in the compressed representation of the image) and prediction error decoding (the inverse operation of the prediction error coding to recover the quantized prediction error signal in the spatial domain).

After applying pixel or sample prediction and error decoding processes the decoder combines the prediction and the prediction error signals (the pixel or sample values) to form the output video frame. The decoder (and encoder) may also apply additional filtering processes in order to improve the quality of the output video before passing it for display and/or storing as a prediction reference for the forthcoming pictures in the video sequence. The filtering may for example include one more of the following: deblocking, sample adaptive offset (SAO), and/or adaptive loop filtering (ALF).

Block-based coding may create visible discontinuities at block boundaries of reconstructed or decoded pictures. Filtering on a boundary of a grid (e.g. a grid of 4x4 luma samples) is determined by an encoder and/or a decoder to be applied when a pre-defined (e.g. in a coding standard) and/or signaled set of conditions is fulfilled, such as the following:

the boundary is a block boundary, such as a prediction unit boundary or a transform unit boundary e.g. as specified for HEVC;

the boundary strength (see below) is significant or relevant, e.g. greater than zero; and the variation of sample values on both sides of the boundary is below a specified threshold, wherein the threshold value may depend on e.g. a quantization parameter used in transform coding.

The boundary strength to be used in deblocking loop filtering can be determined based on several conditions and rules, such as one or more of the following or alike:

when at least one of the blocks adjacent to the boundary is intra -coded, the boundary strength can be set to be significant, such as 2;

when at least one of the blocks adjacent to the boundary has non-zero coded residual coefficient and the boundary is a TU boundary, the boundary strength can be set to be relevant, such as 1 ; when the absolute differences between motion vectors of the two blocks adjacent to the boundary are greater than or equal to 1 in units of integer luma samples, the boundary strength can be set to be relevant, such as 1 ;

when different reference pictures are used for the motion vectors in the two blocks adjacent to the boundary, the boundary strength can be set to be relevant, such as 1 ;

when the number of motion vectors in the two blocks adjacent to the boundary, the boundary strength can be set to be relevant, such as 1 ;

otherwise, the boundary strength can be set to be insignificant (or irrelevant), such as 0.

The deblocking loop filter may include multiple filtering modes or strengths, which may be adaptively selected based on the features of the blocks adjacent to the boundary, such as the quantization parameter value, and/or signaling included by the encoder in the bitstream. For example, the deblocking loop filter may comprise a normal filtering mode and a strong filtering mode, which may differ in terms of the number of filter taps (i.e. number of samples being filtered on both sides of the boundary) and/or the filter tap values. For example, filtering of two samples along both sides of the boundary may be performed with a filter having the impulse response of (3 7 9 -3)/ 16, when omitting the potential impact of a clipping operation.

An example of SAO is given next with reference to HEVC; however, SAO can be similarly applied to other coding schemes too. In SAO, a picture is divided into regions where a separate SAO decision is made for each region. In HEVC, the basic unit for adapting SAO parameters is CTU (therefore an SAO region is the block covered by the corresponding CTU).

In the SAO algorithm, samples in a CTU are classified according to a set of rules and each classified set of samples are enhanced by adding offset values. The offset values are signalled in the bitstream. There are two types of offsets: 1) Band offset 2) Edge offset. For a CTU, either no SAO or band offset or edge offset is employed. Choice of whether no SAO or band or edge offset to be used may be decided by the encoder with e.g. rate distortion optimization (RDO) and signaled to the decoder. The adaptive loop filter (ALF) is another method to enhance quality of the reconstructed samples. This may be achieved by filtering the sample values in the loop. In some embodiments the encoder determines which region of the pictures are to be filtered and the filter coefficients based on e.g. RDO and this information is signalled to the decoder. Inter prediction process may be characterized using one or more of the following factors:

The accuracy of motion vector representation. For example, motion vectors may be of quarter-pixel accuracy, and sample values in fractional-pixel positions may be obtained using a finite impulse response (FIR) filter.

Block partitioning for inter prediction. Many coding standards, including H.264/AVC and HEVC, allow selection of the size and shape of the block for which a motion vector is applied for motion- compensated prediction in the encoder, and indicating the selected size and shape in the bitstream so that decoders can reproduce the motion-compensated prediction done in the encoder.

Number of reference pictures for inter prediction. The sources of inter prediction are previously decoded pictures. Many coding standards, including H.264/AVC and HEVC, enable storage of multiple reference pictures for inter prediction and selection of the used reference picture on a block basis. For example, reference pictures may be selected on macroblock or macroblock partition basis in H.264/AVC and on PU or CU basis in HEVC. Many coding standards, such as H.264/AVC and HEVC, include syntax structures in the bitstream that enable decoders to create one or more reference picture lists. A reference picture index to a reference picture list may be used to indicate which one of the multiple reference pictures is used for inter prediction for a particular block. A reference picture index may be coded by an encoder into the bitstream in selected inter coding modes or it may be derived (by an encoder and a decoder) for example using neighboring blocks in other inter coding modes.

Motion vector prediction. In order to represent motion vectors efficiently in bitstreams, motion vectors may be coded differentially with respect to a block-specific predicted motion vector. In many video codecs, the predicted motion vectors are created in a predefined way, for example, by calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vectors predictions, sometimes referred to as advanced motion vector prediction (AMVP), is to generate a list of candidate predictions from adjacent blocks and/or co- located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index may be predicted, e.g. from adjacent blocks and/or co-located blocks in temporal reference picture. Differential coding of motion vectors may be disabled across slice boundaries.

Multi-hypothesis motion-compensated prediction. H.264/AVC and HEVC enable the use of a single prediction block in P slices (herein referred to as uni-predictive slices) or a linear combination of two motion-compensated prediction blocks for bi-predictive slices, which are also referred to as B slices. Individual blocks in B slices may be bi-predicted, uni-predicted or intra- predicted, and individual blocks in P slices may be uni-predicted or intra-predicted. The reference pictures for a bi-predictive picture may not be limited to be the subsequent picture and the previous picture in output order, but rather any reference picture may be used. In many coding standards, such as H.264/AVC and HEVC, one reference picture list, referred to as reference picture list 0, is constructed for P slices, and two reference picture lists, list 0 and list 1 , are constructed for B slices. For B slices, when prediction is forward direction may refer to prediction form a reference picture in reference picture list 0, and prediction in backward direction may refer to prediction from a reference picture in reference picture list 1, even though the reference pictures for prediction may not have any decoding or output order relation to each other or to the current picture.

Weighted prediction. Many coding standards use a prediction weight of 1 for prediction blocks of inter (P) picture and 0.5 for each prediction block of a B picture (resulting into averaging). H.264/AVC allows weighted prediction for both P and B slices. In implicit weighted prediction, the weight are proportional to picture order counts (POC), while in explicit weighted prediction, prediction weights are explicitly indicated.

The inter prediction process may involve referring to sample locations outside picture boundaries at least for (but not necessarily limited to) the following reasons:

Motion vectors may point to prediction blocks outside picture boundaries; Motion vectors may point to a non-integer sample location for which the sample value is interpolated using filtering that takes input samples from locations that are outside picture boundaries. A motion vector or a piece of motion information may be considered to comprise a horizontal motion vector component and a vertical motion vector component. Sometimes, a motion vector or a piece of motion information may be considered to comprise also information or identification which reference picture is used. A motion field associated with a picture may be considered to comprise of a set of motion information produced for every coded block of the picture. A motion field may be accessible by coordinates of a block, for example. A motion field may be used for example in TMVP of HEVC or any other motion prediction mechanism where a source or a reference for prediction other than the current (de)coded picture is used. Different spatial granularity or units may be applied to represent and/or store a motion field. For example, a regular grid of spatial units may be used. For example, a picture may be divided into rectangular blocks of certain size (with the possible exception of blocks at the edges of the picture, such as on the right edge and the bottom edge). For example, the size of the spatial unit may be equal to the smallest size for which a distinct motion can be indicated by the encoder in the bitstream, such as a 4x4 block in luma sample units. For example, a so-called compressed motion field may be used, where the spatial unit may be equal to a pre-defined or indicated size, such as a 16x16 block in luma sample units, which size may be greater than the smallest size for indicating distinct motion. For example, an HEVC encoder and/or decoder may be implemented in a manner that a motion data storage reduction (MDSR) or motion field compression is performed for each decoded motion field (prior to using the motion field for any prediction between pictures). In an HEVC implementation, MDSR may reduce the granularity of motion data to 16x16 blocks in luma sample units by keeping the motion applicable to the top-left sample of the 16x16 block in the compressed motion field. The encoder may encode indication(s) related to the spatial unit of the compressed motion field as one or more syntax elements and/or syntax element values for example in a sequence-level syntax structure, such as a video parameter set or a sequence parameter set. In some (de)coding methods and/or devices, a motion field may be represented and/or stored according to the block partitioning of the motion prediction (e.g. according to prediction units of the HEVC standard). In some (de)coding methods and/or devices, a combination of a regular grid and block partitioning may be applied so that motion associated with partitions greater than a pre-defined or indicated spatial unit size is represented and/or stored associated with those partitions, whereas motion associated with partitions smaller than or unaligned with a pre-defined or indicated spatial unit size or grid is represented and/or stored for the pre-defined or indicated units. Video encoders may utilize Lagrangian cost functions to find rate-distortion (RD) optimal coding modes, e.g. the desired Macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor λ to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area:

0 = Ο + λΚ, (1)

where C is the Lagrangian cost to be minimized, D is the image distortion (e.g. Mean Squared Error) with the mode and motion vectors considered, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).

Some key definitions, bitstream and coding structures, and concepts of H.264/AVC and HEVC are described in this section as an example of a video encoder, decoder, encoding method, decoding method, and a bitstream structure, wherein the embodiments may be implemented. Some of the key definitions, bitstream and coding structures, and concepts of H.264/AVC are the same as in HEVC - hence, they are described below jointly. The aspects of the invention are not limited to H.264/AVC or HEVC, but rather the description is given for one possible basis on top of which the invention may be partly or fully realized. Similarly to many earlier video coding standards, the bitstream syntax and semantics as well as the decoding process for error-free bitstreams are specified in H.264/AVC and HEVC. The encoding process is not specified, but encoders must generate conforming bitstreams. Bitstream and decoder conformance can be verified with the Hypothetical Reference Decoder (HRD). The standards contain coding tools that help in coping with transmission errors and losses, but the use of the tools in encoding is optional and no decoding process has been specified for erroneous bitstreams. In the description of existing standards as well as in the description of example embodiments, a syntax element may be defined as an element of data represented in the bitstream. A syntax structure may be defined as zero or more syntax elements present together in the bitstream in a specified order. In the description of existing standards as well as in the description of example embodiments, a phrase "by external means" or "through external means" may be used. For example, an entity, such as a syntax structure or a value of a variable used in the decoding process, may be provided "by external means" to the decoding process. The phrase "by external means" may indicate that the entity is not included in the bitstream created by the encoder, but rather conveyed externally from the bitstream for example using a control protocol. It may alternatively or additionally mean that the entity is not created by the encoder, but may be created for example in the player or decoding control logic or alike that is using the decoder. The decoder may have an interface for inputting the external means, such as variable values. The elementary unit for the input to an H.264/AVC or HEVC encoder and the output of an H.264/AVC or HEVC decoder, respectively, is a picture. A picture given as an input to an encoder may also referred to as a source picture, and a picture decoded by a decoded may be referred to as a decoded picture. The source and decoded pictures are each comprised of one or more sample arrays, such as one of the following sets of sample arrays:

Luma (Y) only (monochrome).

Luma and two chroma (YCbCr or YCgCo).

Green, Blue and Red (GBR, also known as RGB).

- Arrays representing other unspecified monochrome or tri-stimulus color samplings (for example,

YZX, also known as XYZ).

In the following, these arrays may be referred to as luma (or L or Y) and chroma, where the two chroma arrays may be referred to as Cb and Cr; regardless of the actual color representation method in use. The actual color representation method in use can be indicated e.g. in a coded bitstream e.g. using the Video Usability Information (VUI) syntax of H.264/AVC and/or HEVC. A component may be defined as an array or single sample from one of the three sample arrays arrays (luma and two chroma) or the array or a single sample of the array that compose a picture in monochrome format. In H.264/AVC and HEVC, a picture may either be a frame or a field. A frame comprises a matrix of luma samples and possibly the corresponding chroma samples. A field is a set of alternate sample rows of a frame and may be used as encoder input, when the source signal is interlaced. Chroma sample arrays may be absent (and hence monochrome sampling may be in use) or chroma sample arrays may be subsampled when compared to luma sample arrays. Chroma formats may be summarized as follows:

- In monochrome sampling there is only one sample array, which may be nominally considered the luma array.

In 4:2:0 sampling, each of the two chroma arrays has half the height and half the width of the luma array.

In 4:2:2 sampling, each of the two chroma arrays has the same height and half the width of the luma array.

In 4:4:4 sampling when no separate color planes are in use, each of the two chroma arrays has the same height and width as the luma array.

In H.264/AVC and HEVC, it is possible to code sample arrays as separate color planes into the bitstream and respectively decode separately coded color planes from the bitstream. When separate color planes are in use, each one of them is separately processed (by the encoder and/or the decoder) as a picture with monochrome sampling.

A partitioning may be defined as a division of a set into subsets such that each element of the set is in exactly one of the subsets.

When describing the operation of HEVC encoding and/or decoding, the following terms may be used. A coding block may be defined as an NxN block of samples for some value of N such that the division of a coding tree block into coding blocks is a partitioning. A coding tree block (CTB) may be defined as an NxN block of samples for some value of N such that the division of a component into coding tree blocks is a partitioning. A coding tree unit (CTU) may be defined as a coding tree block of luma samples, two corresponding coding tree blocks of chroma samples of a picture that has three sample arrays, or a coding tree block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples. A coding unit (CU) may be defined as a coding block of luma samples, two corresponding coding blocks of chroma samples of a picture that has three sample arrays, or a coding block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples.

In some video codecs, such as High Efficiency Video Coding (HEVC) codec, video pictures are divided into coding units (CU) covering the area of the picture. A CU consists of one or more prediction units (PU) defining the prediction process for the samples within the CU and one or more transform units (TU) defining the prediction error coding process for the samples in the said CU. Typically, a CU consists of a square block of samples with a size selectable from a predefined set of possible CU sizes. A CU with the maximum allowed size may be named as LCU (largest coding unit) or coding tree unit (CTU) and the video picture is divided into non-overlapping LCUs. An LCU can be further split into a combination of smaller CUs, e.g. by recursively splitting the LCU and resultant CUs. Each resulting CU typically has at least one PU and at least one TU associated with it. Each PU and TU can be further split into smaller PUs and TUs in order to increase granularity of the prediction and prediction error coding processes, respectively. Each PU has prediction information associated with it defining what kind of a prediction is to be applied for the pixels within that PU (e.g. motion vector information for inter predicted PUs and intra prediction directionality information for intra predicted PUs).

Each TU can be associated with information describing the prediction error decoding process for the samples within the said TU (including e.g. DCT coefficient information). It is typically signalled at CU level whether prediction error coding is applied or not for each CU. In the case there is no prediction error residual associated with the CU, it can be considered there are no TUs for the said CU. The division of the image into CUs, and division of CUs into PUs and TUs is typically signalled in the bitstream allowing the decoder to reproduce the intended structure of these units.

In HEVC, a picture can be partitioned in tiles, which are rectangular and contain an integer number of LCUs. In HEVC, the partitioning to tiles forms a regular grid, where heights and widths of tiles differ from each other by one LCU at the maximum. In HEVC, a slice is defined to be an integer number of coding tree units contained in one independent slice segment and all subsequent dependent slice segments (if any) that precede the next independent slice segment (if any) within the same access unit. In HEVC, a slice segment is defined to be an integer number of coding tree units ordered consecutively in the tile scan and contained in a single NAL unit. The division of each picture into slice segments is a partitioning. In HEVC, an independent slice segment is defined to be a slice segment for which the values of the syntax elements of the slice segment header are not inferred from the values for a preceding slice segment, and a dependent slice segment is defined to be a slice segment for which the values of some syntax elements of the slice segment header are inferred from the values for the preceding independent slice segment in decoding order. In HEVC, a slice header is defined to be the slice segment header of the independent slice segment that is a current slice segment or is the independent slice segment that precedes a current dependent slice segment, and a slice segment header is defined to be a part of a coded slice segment containing the data elements pertaining to the first or all coding tree units represented in the slice segment. The CUs are scanned in the raster scan order of LCUs within tiles or within a picture, if tiles are not in use. Within an LCU, the CUs have a specific scan order.

Video coding standards and specifications may allow encoders to divide a coded picture to coded slices or alike. In-picture prediction is typically disabled across slice boundaries. Thus, slices can be regarded as a way to split a coded picture to independently decodable pieces. In H.264/AVC and HEVC, in-picture prediction may be disabled across slice boundaries. Thus, slices can be regarded as a way to split a coded picture into independently decodable pieces, and slices are therefore often regarded as elementary units for transmission. In many cases, encoders may indicate in the bitstream which types of in-picture prediction are turned off across slice boundaries, and the decoder operation takes this information into account for example when concluding which prediction sources are available. For example, samples from a neighboring macroblock or CU may be regarded as unavailable for intra prediction, if the neighboring macroblock or CU resides in a different slice.

An elementary unit for the output of an H.264/AVC or HEVC encoder and the input of an H.264/AVC or HEVC decoder, respectively, is a Network Abstraction Layer (NAL) unit. For transport over packet- oriented networks or storage into structured files, NAL units may be encapsulated into packets or similar structures. A NAL unit may be defined as a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of an RBSP interspersed as necessary with startcode emulation prevention bytes. A raw byte sequence payload (RBSP) may be defined as a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit. An RBSP is either empty or has the form of a string of data bits containing syntax elements followed by an RBSP stop bit and followed by zero or more subsequent bits equal to 0. NAL units consist of a header and payload.

In HEVC, a two-byte NAL unit header is used for all specified NAL unit types. The NAL unit header contains one reserved bit, a six-bit NAL unit type indication, a three-bit nuh_temporal_id_plusl indication for temporal level (may be required to be greater than or equal to 1) and a six-bit nuh layer id syntax element. The temporal_id_plusl syntax element may be regarded as a temporal identifier for the NAL unit, and a zero-based Temporalld variable may be derived as follows: Temporalld = temporal_id_plusl - 1. Temporalld equal to 0 corresponds to the lowest temporal level. The value of temporal_id_plusl is required to be non-zero in order to avoid start code emulation involving the two NAL unit header bytes. The bitstream created by excluding all VCL NAL units having a Temporalld greater than or equal to a selected value and including all other VCL NAL units remains conforming. Consequently, a picture having Temporalld equal to TID does not use any picture having a Temporalld greater than TID as inter prediction reference. A sub-layer or a temporal sub-layer may be defined to be a temporal scalable layer of a temporal scalable bitstream, consisting of VCL NAL units with a particular value of the Temporalld variable and the associated non-VCL NAL units, nuh layer id can be understood as a scalability layer identifier.

NAL units can be categorized into Video Coding Layer (VCL) NAL units and non-VCL NAL units. In H.264/AVC, coded slice NAL units contain syntax elements representing one or more coded macroblocks, each of which corresponds to a block of samples in the uncompressed picture. In HEVC, VCL NAL units contain syntax elements representing one or more CU.

A non-VCL NAL unit may be for example one of the following types: a sequence parameter set, a picture parameter set, a supplemental enhancement information (SEI) NAL unit, an access unit delimiter, an end of sequence NAL unit, an end of bitstream NAL unit, or a filler data NAL unit. Parameter sets may be needed for the reconstruction of decoded pictures, whereas many of the other non-VCL NAL units are not necessary for the reconstruction of decoded sample values.

Parameters that remain unchanged through a coded video sequence may be included in a sequence parameter set. In addition to the parameters that may be needed by the decoding process, the sequence parameter set may optionally contain video usability information (VUI), which includes parameters that may be important for buffering, picture output timing, rendering, and resource reservation. In HEVC a sequence parameter set RBSP includes parameters that can be referred to by one or more picture parameter set RBSPs or one or more SEI NAL units containing a buffering period SEI message. A picture parameter set contains such parameters that are likely to be unchanged in several coded pictures. A picture parameter set RBSP may include parameters that can be referred to by the coded slice NAL units of one or more coded pictures.

In HEVC, a video parameter set (VPS) may be defined as a syntax structure containing syntax elements that apply to zero or more entire coded video sequences as determined by the content of a syntax element found in the SPS referred to by a syntax element found in the PPS referred to by a syntax element found in each slice segment header. A video parameter set RBSP may include parameters that can be referred to by one or more sequence parameter set RBSPs.

The relationship and hierarchy between video parameter set (VPS), sequence parameter set (SPS), and picture parameter set (PPS) may be described as follows. VPS resides one level above SPS in the parameter set hierarchy and in the context of scalability and/or 3D video. VPS may include parameters that are common for all slices across all (scalability or view) layers in the entire coded video sequence. SPS includes the parameters that are common for all slices in a particular (scalability or view) layer in the entire coded video sequence, and may be shared by multiple (scalability or view) layers. PPS includes the parameters that are common for all slices in a particular layer representation (the representation of one scalability or view layer in one access unit) and are likely to be shared by all slices in multiple layer representations.

H.264/AVC and HEVC syntax allows many instances of parameter sets, and each instance is identified with a unique identifier. In order to limit the memory usage needed for parameter sets, the value range for parameter set identifiers has been limited. In H.264/AVC and HEVC, each slice header includes the identifier of the picture parameter set that is active for the decoding of the picture that contains the slice, and each picture parameter set contains the identifier of the active sequence parameter set. Consequently, the transmission of picture and sequence parameter sets does not have to be accurately synchronized with the transmission of slices. Instead, it is sufficient that the active sequence and picture parameter sets are received at any moment before they are referenced, which allows transmission of parameter sets "out-of- band" using a more reliable transmission mechanism compared to the protocols used for the slice data. For example, parameter sets can be included as a parameter in the session description for Real-time Transport Protocol (RTP) sessions. If parameter sets are transmitted in -band, they can be repeated to improve error robustness.

Out-of-band transmission, signaling or storage can additionally or alternatively be used for other purposes than tolerance against transmission errors, such as ease of access or session negotiation. For example, a sample entry of a track in a file conforming to the ISOBMFF may comprise parameter sets, while the coded data in the bitstream is stored elsewhere in the file or in another file. The phrase along the bitstream (e.g. indicating along the bitstream) may be used in claims and described embodiments to refer to out-of-band transmission, signaling, or storage in a manner that the out-of-band data is associated with the bitstream. The phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream.

There may be different types of intra prediction modes available in a coding scheme, out of which an encoder can select and indicate the used one, e.g. on block or coding unit basis. A decoder may decode the indicated intra prediction mode and reconstruct the prediction block accordingly. For example, several angular intra prediction modes, each for different angular direction, may be available. Angular intra prediction may be considered to extrapolate the border samples of adjacent blocks along a linear prediction direction. Additionally or alternatively, a planar prediction mode may be available. Planar prediction may be considered to essentially form a prediction bock, in which each sample of a prediction block may be specified to be an average of vertically aligned sample in the adjacent sample column on the left of the current block and the horizontally aligned sample in the adjacent sample line above the current block. Additionally or alternatively, a DC prediction mode may be available, in which the prediction block is essentially an average sample value of a neighboring block or blocks.

H.265/HEVC includes two motion vector prediction schemes, namely the advanced motion vector prediction (AMVP) and the merge mode. In the AMVP or the merge mode, a list of motion vector candidates is derived for a PU. There are two kinds of candidates: spatial candidates and temporal candidates, where temporal candidates may also be referred to as TMVP candidates.

One of the candidates in the merge list and/or the candidate list for AMVP or any similar motion vector candidate list may be a TMVP candidate or alike, which may be derived from the collocated block within an indicated or inferred reference picture, such as the reference picture indicated for example in the slice header. In HEVC, the reference picture list to be used for obtaining a collocated partition is chosen according to the collocated from lO flag syntax element in the slice header. When the flag is equal to 1, it specifies that the picture that contains the collocated partition is derived from list 0, otherwise the picture is derived from list 1. When collocated from lO flag is not present, it is inferred to be equal to 1. The collocated ref idx in the slice header specifies the reference index of the picture that contains the collocated partition. When the current slice is a P slice, collocated ref idx refers to a picture in list 0. When the current slice is a B slice, collocated ref idx refers to a picture in list 0 if collocated from lO is 1, otherwise it refers to a picture in list 1. collocated ref idx always refers to a valid list entry, and the resulting picture is the same for all slices of a coded picture. When collocated ref idx is not present, it is inferred to be equal to 0. In HEVC the so-called target reference index for temporal motion vector prediction in the merge list is set as 0 when the motion coding mode is the merge mode. When the motion coding mode in HEVC utilizing the temporal motion vector prediction is the advanced motion vector prediction mode, the target reference index values are explicitly indicated (e.g. per each PU).

In HEVC, the availability of a candidate predicted motion vector (PMV) may be determined as follows (both for spatial and temporal candidates) (SRTP = short-term reference picture, LRTP = long-term reference picture):

In HEVC, when the target reference index value has been determined, the motion vector value of the temporal motion vector prediction may be derived as follows: The motion vector PMV at the block that is collocated with the bottom-right neighbor of the current prediction unit is obtained. The picture where the collocated block resides may be e.g. determined according to the signalled reference index in the slice header as described above. If the PMV at bottom-right neighbor is not available, the motion vector PMV at the location of the current PU of the collocated picture is obtained. The determined available motion vector PMV at the co-located block is scaled with respect to the ratio of a first picture order count difference and a second picture order count difference. The first picture order count (POC) difference is derived between the picture containing the co-located block and the reference picture of the motion vector of the co-located block. The second picture order count difference is derived between the current picture and the target reference picture. If one but not both of the target reference picture and the reference picture of the motion vector of the collocated block is a long-term reference picture (while the other is a short-term reference picture), the TMVP candidate may be considered unavailable. If both of the target reference picture and the reference picture of the motion vector of the collocated block are long-term reference pictures, no POC- based motion vector scaling may be applied.

In several video coding standards, such as H.263, H.264/AVC and H.265/HEVC, motion vectors are allowed to point to an area outside the picture boundaries for obtaining the prediction block and fractional sample interpolation may use sample locations outside picture boundaries. For obtaining samples outside picture boundaries as part of the prediction process, the respective samples of the picture boundaries are effectively copied. The mechanism to support sample locations outside picture boundaries in the inter prediction process may be implemented in multiple ways. One way is to allocate a sample array that is larger than the decoded picture size, i.e. has margins on top of, below, on the right side, and on the left side of the image. In addition to or instead of using such margins, the location of a sample used for prediction (either as input to fractional sample interpolation for the prediction block or as a sample in the prediction block itself) may be saturated so that the location does not exceed the picture boundaries (with margins, if such are used). Some of the video coding standards describe the support of motion vectors over picture boundaries in such manner.

When encoding and/or decoding 360-degree panoramic video, samples outside picture boundaries can be used as reference due to motion vectors pointing outside the picture boundaries and/or due to fractional sample interpolation using sample values outside the picture boundaries, as described earlier. Thanks to the fact that the entire 360 degrees of field-of-view is represented, the sample values from the opposite side of the picture can be used instead of the conventional approach of using the boundary sample when a sample horizontally outside the picture boundary is needed in a prediction process. In addition to the above-mentioned two techniques (i.e. referring to samples outside picture boundaries and wrapping the horizontal sample location for referred samples), also a combination of these techniques can be utilized. The margins can be set e.g. to cover the largest that refers to both samples inside decoded picture boundaries and outside decoded pictures boundaries. Sample location wrapping is used for prediction units that are completely outside decoded picture boundaries. This combination method may enable faster memory access than the approach of only using wrapping of the sample location.

As mentioned above, a motion vector can refer to a non-integer sample position in a reference picture. The sample values at a non-integer sample position can be obtained through a fractional sample interpolation process. A different process may be used for the luma sample array than for the chroma sample arrays. A fractional sample interpolation process for luma according to an example may operate as described next. The presented process is from HEVC and it needs to be understood that it is provided for exemplary purposes and that similar process can be realized e.g. by changing the number of filter taps.

Scalable video coding may refer to coding structure where one bitstream can contain multiple representations of the content, for example, at different bitrates, resolutions or frame rates. In these cases the receiver can extract the desired representation depending on its characteristics (e.g. resolution that matches best the display device). Alternatively, a server or a network element can extract the portions of the bitstream to be transmitted to the receiver depending on e.g. the network characteristics or processing capabilities of the receiver. A meaningful decoded representation can be produced by decoding only certain parts of a scalable bit stream. A scalable bitstream typically consists of a "base layer" providing the lowest quality video available and one or more enhancement layers that enhance the video quality when received and decoded together with the lower layers. In order to improve coding efficiency for the enhancement layers, the coded representation of that layer typically depends on the lower layers. E.g. the motion and mode information of the enhancement layer can be predicted from lower layers. Similarly the pixel data of the lower layers can be used to create prediction for the enhancement layer.

In some scalable video coding schemes, a video signal can be encoded into a base layer and one or more enhancement layers. An enhancement layer may enhance, for example, the temporal resolution (i.e., the frame rate), the spatial resolution, or simply the quality of the video content represented by another layer or part thereof. Each layer together with all its dependent layers is one representation of the video signal, for example, at a certain spatial resolution, temporal resolution and quality level. In this document, we refer to a scalable layer together with all of its dependent layers as a "scalable layer representation". The portion of a scalable bitstream corresponding to a scalable layer representation can be extracted and decoded to produce a representation of the original signal at certain fidelity.

Scalability modes or scalability dimensions may include but are not limited to the following:

Quality scalability: Base layer pictures are coded at a lower quality than enhancement layer pictures, which may be achieved for example using a greater quantization parameter value (i.e., a greater quantization step size for transform coefficient quantization) in the base layer than in the enhancement layer.

Spatial scalability: Base layer pictures are coded at a lower resolution (i.e. have fewer samples) than enhancement layer pictures. Spatial scalability and quality scalability, particularly its coarse- grain scalability type, may sometimes be considered the same type of scalability.

Bit-depth scalability: Base layer pictures are coded at lower bit-depth (e.g. 8 bits) than enhancement layer pictures (e.g. 10 or 12 bits).

Dynamic range scalability: Scalable layers represent a different dynamic range and/or images obtained using a different tone mapping function and/or a different optical transfer function. Chroma format scalability: Base layer pictures provide lower spatial resolution in chroma sample arrays (e.g. coded in 4:2:0 chroma format) than enhancement layer pictures (e.g. 4:4:4 format). Color gamut scalability: enhancement layer pictures have a richer/broader color representation range than that of the base layer pictures - for example the enhancement layer may have UHDTV (ITU-R BT.2020) color gamut and the base layer may have the ITU-R BT.709 color gamut. View scalability, which may also be referred to as multiview coding. The base layer represents a first view, whereas an enhancement layer represents a second view.

Depth scalability, which may also be referred to as depth-enhanced coding. A layer or some layers of a bitstream may represent texture view(s), while other layer or layers may represent depth view(s).

Region-of-interest scalability (as described below).

Interlaced-to-progressive scalability (also known as fi eld-to-frame scalability): coded interlaced source content material of the base layer is enhanced with an enhancement layer to represent progressive source content.

- Hybrid codec scalability (also known as coding standard scalability): In hybrid codec scalability, the bitstream syntax, semantics and decoding process of the base layer and the enhancement layer are specified in different video coding standards. Thus, base layer pictures are coded according to a different coding standard or format than enhancement layer pictures. For example, the base layer may be coded with H.264/AVC and an enhancement layer may be coded with an HEVC multi- layer extension. An external base layer picture may be defined as a decoded picture that is provided by external means for the enhancement-layer decoding process and that is treated like a decoded base-layer picture for the enhancement layer decoding process. SHVC and MV-HEVC allow the use of external base layer pictures. It should be understood that many of the scalability types may be combined and applied together. For example color gamut scalability and bit-depth scalability may be combined.

The term layer may be used in context of any type of scalability, including view scalability and depth enhancements. An enhancement layer may refer to any type of an enhancement, such as SNR, spatial, multiview, depth, bit-depth, chroma format, and/or color gamut enhancement. A base layer may refer to any type of a base video sequence, such as a base view, a base layer for SNR/spatial scalability, or a texture base view for depth-enhanced video coding.

Various technologies for providing three-dimensional (3D) video content are currently investigated and developed. It may be considered that in stereoscopic or two-view video, one video sequence or view is presented for the left eye while a parallel view is presented for the right eye. More than two parallel views may be needed for applications which enable viewpoint switching or for autostereoscopic displays which may present a large number of views simultaneously and let the viewers to observe the content from different viewpoints. A view may be defined as a sequence of pictures representing one camera or viewpoint. The pictures representing a view may also be called view components. In other words, a view component may be defined as a coded representation of a view in a single access unit. In multiview video coding, more than one view is coded in a bitstream. Since views are typically intended to be displayed on stereoscopic or multiview autostereoscopic display or to be used for other 3D arrangements, they typically represent the same scene and are content-wise partly overlapping although representing different viewpoints to the content. Hence, inter-view prediction may be utilized in multiview video coding to take advantage of inter- view correlation and improve compression efficiency. One way to realize inter- view prediction is to include one or more decoded pictures of one or more other views in the reference picture list(s) of a picture being coded or decoded residing within a first view. View scalability may refer to such multiview video coding or multiview video bitstreams, which enable removal or omission of one or more coded views, while the resulting bitstream remains conforming and represents video with a smaller number of views than originally. Region of Interest (ROI) coding may be defined to refer to coding a particular region within a video at a higher fidelity. ROI scalability may be defined as a type of scalability wherein an enhancement layer enhances only part of a source picture for inter-layer prediction e.g. spatially, quality- wise, in bit-depth, and/or along other scalability dimensions. As ROI scalability may be used together with other types of scalabilities, it may be considered to form a different categorization of scalability types. There exists several different applications for ROI coding with different requirements, which may be realized by using ROI scalability. For example, an enhancement layer can be transmitted to enhance the quality and/or a resolution of a region in the base layer. A decoder receiving both enhancement and base layer bitstream might decode both layers and overlay the decoded pictures on top of each other and display the final picture. Scalability may be enabled in two basic ways. Either by introducing new coding modes for performing prediction of pixel values or syntax from lower layers of the scalable representation or by placing the lower layer pictures to a reference picture buffer (e.g. a decoded picture buffer, DPB) of the higher layer. The first approach may be more flexible and thus may provide better coding efficiency in most cases. However, the second approach may be implemented efficiently with minimal changes to single layer codecs while still achieving majority of the coding efficiency gains available. The second approach may be called for example reference frame based scalability or high-level-syntax-only scalable video coding. Essentially a reference frame based scalability codec may be implemented by utilizing the same hardware or software implementation for all the layers, just taking care of the DPB management by external means. A scalable video encoder for quality scalability (also known as Signal-to-Noise or SNR) and/or spatial scalability may be implemented as follows. For a base layer, a conventional non-scalable video encoder and decoder may be used. The reconstructed/decoded pictures of the base layer are included in the reference picture buffer and/or reference picture lists for an enhancement layer. In case of spatial scalability, the reconstructed/decoded base-layer picture may be upsampled prior to its insertion into the reference picture lists for an enhancement-layer picture. The base layer decoded pictures may be inserted into a reference picture list(s) for coding/decoding of an enhancement layer picture similarly to the decoded reference pictures of the enhancement layer. Consequently, the encoder may choose a base-layer reference picture as an inter prediction reference and indicate its use with a reference picture index in the coded bitstream. The decoder decodes from the bitstream, for example from a reference picture index, that a base-layer picture is used as an inter prediction reference for the enhancement layer. When a decoded base-layer picture is used as the prediction reference for an enhancement layer, it is referred to as an inter-layer reference picture.

While the previous paragraph described a scalable video codec with two scalability layers with an enhancement layer and a base layer, it needs to be understood that the description can be generalized to any two layers in a scalability hierarchy with more than two layers. In this case, a second enhancement layer may depend on a first enhancement layer in encoding and/or decoding processes, and the first enhancement layer may therefore be regarded as the base layer for the encoding and/or decoding of the second enhancement layer. Furthermore, it needs to be understood that there may be inter-layer reference pictures from more than one layer in a reference picture buffer or reference picture lists of an enhancement layer, and each of these inter-layer reference pictures may be considered to reside in a base layer or a reference layer for the enhancement layer being encoded and/or decoded. Furthermore, it needs to be understood that other types of inter-layer processing than reference-layer picture upsampling may take place instead or additionally. For example, the bit-depth of the samples of the reference-layer picture may be converted to the bit-depth of the enhancement layer and/or the sample values may undergo a mapping from the color space of the reference layer to the color space of the enhancement layer.

A scalable video coding and/or decoding scheme may use multi-loop coding and/or decoding, which may be characterized as follows. In the encoding/decoding, a base layer picture may be reconstructed/decoded to be used as a motion-compensation reference picture for subsequent pictures, in coding/decoding order, within the same layer or as a reference for inter-layer (or inter- view or inter-component) prediction. The reconstructed/decoded base layer picture may be stored in the DPB. An enhancement layer picture may likewise be reconstructed/decoded to be used as a motion-compensation reference picture for subsequent pictures, in coding/decoding order, within the same layer or as reference for inter-layer (or inter- view or inter-component) prediction for higher enhancement layers, if any. In addition to reconstructed/decoded sample values, syntax element values of the base/reference layer or variables derived from the syntax element values of the base/reference layer may be used in the inter-layer/inter-component/inter-view prediction. Scalable and multiview extensions to the first version of the High Efficiency Video Coding (HEVC) standard were finalized in 2014. The scalable video coding extension (SHVC) provides a mechanism for offering spatial, bit-depth, color gamut, and quality scalability while exploiting the inter-layer redundancy. The multiview extension (MV-HEVC) enables coding of multiview video data suitable e.g. for stereoscopic displays. For MV-HEVC, the input multiview video sequences for encoding are typically captured by a number of cameras arranged in a row. The camera projection centers are typically collinear and equally distant from each neighbor and cameras typically point to the same direction. SHVC and MV-HEVC share the same high-level syntax and most parts of their decoding process are also identical, which makes it appealing to support both SHVC and MV-HEVC with the same codec implementation. SHVC and MV- HEVC were included in HEVC version 2.

In MV-HEVC, inter- view reference pictures can be included in the reference picture list(s) of the current picture being coded or decoded. SHVC uses multi-loop decoding operation. SHVC may be considered to use a reference index based approach, i.e. an inter-layer reference picture can be included in a one or more reference picture lists of the current picture being coded or decoded (as described above).

For the enhancement layer coding, the concepts and coding tools of HEVC base layer may be used in SHVC, MV-HEVC, and/or alike. However, the additional inter-layer prediction tools, which employ already coded data (including reconstructed picture samples and motion parameters a.k.a motion information) in reference layer for efficiently coding an enhancement layer, may be integrated to SHVC, MV-HEVC, and/or alike codec.

As described previously prediction methods applied for video and/or image coding and/or decoding may be categorized into sample prediction and syntax prediction. A complementary way of categorizing different types of prediction is to consider across which domains or scalability types the prediction crosses. This categorization may lead into one or more of the following types of prediction, which may also sometimes be referred to as prediction directions:

Temporal prediction e.g. of sample values or motion vectors from an earlier picture usually of the same scalability layer, view and component type (texture or depth).

- Inter-view prediction (which may be also referred to as cross-view prediction) referring to prediction taking place between view components usually of the same time instant or access unit and the same component type.

Inter-layer prediction referring to prediction taking place between layers usually of the same time instant, of the same component type, and of the same view.

- Inter-component prediction may be defined to comprise prediction of syntax element values, sample values, variable values used in the decoding process, or anything alike from a component picture of one type to a component picture of another type. For example, inter-component prediction may comprise prediction of a texture view component from a depth view component, or vice versa. In another example, inter-component prediction takes place from the luma component (or sample array) to the chroma components (or sample arrays).

Inter-layer prediction may be defined as prediction in a manner that is dependent on data elements (e.g., sample values or motion vectors) of reference pictures from a different layer than the layer of the current picture (being encoded or decoded). Many types of inter-layer prediction exist and may be applied in a scalable video encoder/decoder. The available types of inter-layer prediction may for example depend on the coding profile according to which the bitstream or a particular layer within the bitstream is being encoded or, when decoding, the coding profile that the bitstream or a particular layer within the bitstream is indicated to conform to. Alternatively or additionally, the available types of inter-layer prediction may depend on the types of scalability or the type of an scalable codec or video coding standard amendment (e.g. SHVC, MV-HEVC, or 3D-HEVC) being used.

The types of inter-layer prediction may comprise, but are not limited to, one or more of the following: inter- layer sample prediction, inter-layer motion prediction, inter-layer residual prediction. In inter-layer sample prediction, at least a subset of the reconstructed sample values of a source picture for inter-layer prediction are used as a reference for predicting sample values of the current picture. In inter-layer motion prediction, at least a subset of the motion vectors of a source picture for inter-layer prediction are used as a reference for predicting motion vectors of the current picture. Typically, predicting information on which reference pictures are associated with the motion vectors is also included in inter-layer motion prediction. For example, the reference indices of reference pictures for the motion vectors may be inter-layer predicted and/or the picture order count or any other identification of a reference picture may be inter-layer predicted. In some cases, inter-layer motion prediction may also comprise prediction of block coding mode, header information, block partitioning, and/or other similar parameters. In some cases, coding parameter prediction, such as inter-layer prediction of block partitioning, may be regarded as another type of inter- layer prediction. In inter-layer residual prediction, the prediction error or residual of selected blocks of a source picture for inter-layer prediction is used for predicting the current picture.

Inter-view prediction may be considered to be equivalent or similar to inter-layer prediction but apply between views rather than other scalability types or dimensions. Sometimes inter- view prediction may refer only to inter-view sample prediction, which is similar to motion-compensated temporal prediction but applies between views. Sometimes inter-view prediction may be considered to comprise all types of prediction that can take place between views, such as both inter-view sample prediction and inter-view motion prediction. In multiview-plus-depth coding, such as 3D-HEVC, cross-component inter-layer prediction may be applied, in which a picture of a first type, such as a depth picture, may affect the inter-layer prediction of a picture of a second type, such as a conventional texture picture. For example, disparity-compensated inter- layer sample value and/or motion prediction may be applied, where the disparity may be at least partially derived from a depth picture. The term view synthesis prediction may be used when a prediction block is constructed at least partly on the basis of associated depth or disparity information.

A direct reference layer may be defined as a layer that may be used for inter-layer prediction of another layer for which the layer is the direct reference layer. A direct predicted layer may be defined as a layer for which another layer is a direct reference layer. An indirect reference layer may be defined as a layer that is not a direct reference layer of a second layer but is a direct reference layer of a third layer that is a direct reference layer or indirect reference layer of a direct reference layer of the second layer for which the layer is the indirect reference layer. An indirect predicted layer may be defined as a layer for which another layer is an indirect reference layer. An independent layer may be defined as a layer that does not have direct reference layers. In other words, an independent layer is not predicted using inter-layer prediction. A non- base layer may be defined as any other layer than the base layer, and the base layer may be defined as the lowest layer in the bitstream. An independent non-base layer may be defined as a layer that is both an independent layer and a non-base layer. A source picture for inter-layer prediction may be defined as a decoded picture that either is, or is used in deriving, an inter-layer reference picture that may be used as a reference picture for prediction of the current picture. In multi-layer HEVC extensions, an inter-layer reference picture is included in an inter-layer reference picture set of the current picture. An inter-layer reference picture may be defined as a reference picture that may be used for inter-layer prediction of the current picture. In the coding and/or decoding process, the inter-layer reference pictures may be treated as long term reference pictures. A reference-layer picture may be defined as a picture in a direct reference layer of a particular layer or a particular picture, such as the current layer or the current picture (being encoded or decoded). A reference-layer picture may but need not be used as a source picture for inter-layer prediction. Sometimes, the terms reference-layer picture and source picture for inter-layer prediction may be used interchangeably.

A source picture for inter-layer prediction may be required to be in the same access unit as the current picture. In some cases, e.g. when no resampling, motion field mapping or other inter-layer processing is needed, the source picture for inter-layer prediction and the respective inter-layer reference picture may be identical. In some cases, e.g. when resampling is needed to match the sampling grid of the reference layer to the sampling grid of the layer of the current picture (being encoded or decoded), inter-layer processing is applied to derive an inter-layer reference picture from the source picture for inter-layer prediction. Examples of such inter-layer processing are described in the next paragraphs.

Inter-layer sample prediction may be comprise resampling of the sample array(s) of the source picture for inter-layer prediction. The encoder and/or the decoder may derive a horizontal scale factor (e.g. stored in variable ScaleFactorHor) and a vertical scale factor (e.g. stored in variable ScaleFactorVer) for a pair of an enhancement layer and its reference layer for example based on the reference layer location offsets for the pair. If either or both scale factors are not equal to 1 , the source picture for inter- layer prediction may be resampled to generate an inter-layer reference picture for predicting the enhancement layer picture. The process and/or the filter used for resampling may be pre-defined for example in a coding standard and/or indicated by the encoder in the bitstream (e.g. as an index among pre-defined resampling processes or filters) and/or decoded by the decoder from the bitstream. A different resampling process may be indicated by the encoder and/or decoded by the decoder and/or inferred by the encoder and/or the decoder depending on the values of the scale factor. For example, when both scale factors are less than 1 , a pre-defined downsampling process may be inferred; and when both scale factors are greater than 1, a pre-defined upsampling process may be inferred. Additionally or alternatively, a different resampling process may be indicated by the encoder and/or decoded by the decoder and/or inferred by the encoder and/or the decoder depending on which sample array is processed. For example, a first resampling process may be inferred to be used for luma sample arrays and a second resampling process may be inferred to be used for chroma sample arrays.

Resampling may be performed for example picture-wise (for the entire source picture for inter-layer prediction or reference region to be resampled), slice-wise (e.g. for a reference region corresponding to an enhancement layer slice) or block-wise (e.g. for a reference region corresponding to an enhancement layer coding tree unit). The resampling of a determined region (e.g. a picture, slice, or coding tree unit in an enhancement layer picture) of a source picture for inter-layer prediction may for example be performed by looping over all sample positions of the determined region and performing a sample-wise resampling process for each sample position. However, it is to be understood that other possibilities for resampling a determined region exist - for example, the filtering of a certain sample location may use variable values of the previous sample location.

SVHC and MV-HEVC enable inter-layer sample prediction and inter-layer motion prediction. In the inter- layer sample prediction, the inter-layer reference (ILR) picture is used to obtain the sample values of a prediction block. In MV-HEVC, the source picture for inter-layer prediction acts, without modifications, as an ILR picture. In spatial and color gamut scalability of SHVC, inter- layer processing, such as resampling, is applied to the source picture for inter-layer prediction to obtain an ILR picture. In the resampling process of SHVC, the source picture for inter-layer prediction may be cropped, upsampled and/or padded to obtain an ILR picture. The relative position of the upsampled source picture for inter-layer prediction to the enhancement layer picture is indicated through so-called reference layer location offsets. This feature enables region-of-interest (ROI) scalability, in which only subset of the picture area of the base layer is enhanced in an enhancement layer picture.

SHVC enables the use of weighted prediction or a color-mapping process based on a 3D lookup table (LUT) for (but not limited to) color gamut scalability. The 3D LUT approach may be described as follows. The sample value range of each color components may be first split into two ranges, forming up to 2x2x2 octants, and then the luma ranges can be further split up to four parts, resulting into up to 8x2x2 octants. Within each octant, a cross color component linear model is applied to perform color mapping. For each octant, four vertices are encoded into and/or decoded from the bitstream to represent a linear model within the octant. The color-mapping table is encoded into and/or decoded from the bitstream separately for each color component. Color mapping may be considered to involve three steps: First, the octant to which a given reference- layer sample triplet (Y, Cb, Cr) belongs is determined. Second, the sample locations of luma and chroma may be aligned through applying a color component adjustment process. Third, the linear mapping specified for the determined octant is applied. The mapping may have cross-component nature, i.e. an input value of one color component may affect the mapped value of another color component. Additionally, if inter-layer resampling is also required, the input to the resampling process is the picture that has been color-mapped. The color-mapping may (but needs not to) map samples of a first bit-depth to samples of another bit-depth.

Inter-layer motion prediction may be realized as follows. A temporal motion vector prediction process, such as TMVP of H.265/HEVC, may be used to exploit the redundancy of motion data between different layers. This may be done as follows: when the source picture for inter-layer prediction is upsampled, the motion data of the source picture for inter-layer prediction is also mapped to the resolution of an enhancement layer in a process that may be referred to as motion field mapping (MFM). If the enhancement layer picture utilizes motion vector prediction from the base layer picture e.g. with a temporal motion vector prediction mechanism such as TMVP of H.265/HEVC, the corresponding motion vector predictor is originated from the mapped reference-layer motion field. This way the correlation between the motion data of different layers may be exploited to improve the coding efficiency of a scalable video coder. In SHVC and/or alike, inter-layer motion prediction may be performed by setting the inter-layer reference picture as the collocated reference picture for TMVP derivation. Hence, the mapped motion field is the source of TMVP candidates in the motion vector prediction process. In spatial scalability of SHVC, motion field mapping (MFM) is used to obtain the motion information for the ILR picture from that of the base-layer picture, while if no spatial scalability applies between layers the mapped motion field is identical to that of the source picture for inter-layer prediction. In MFM, the prediction dependency in source pictures for inter-layer prediction is duplicated to generate the reference picture list(s) for ILR pictures, while the motion vectors (MV) are re-scaled according to the spatial resolution ration between the ILR picture and the base-layer picture. In contrast, MFM is not applied in MV-HEVC for reference-view picture to be referenced during the inter- layer motion prediction process.

The spatial correspondence of a reference-layer picture and an enhancement-layer picture may be inferred or may be indicated with one or more types of so-called reference layer location offsets. It may be allowed to indicate the spatial correspondence of two layers with reference layer location offsets, regardless of whether there is inter-layer prediction between the layers. In HEVC, reference layer location offsets may be included in the PPS by the encoder and decoded from the PPS by the decoder. Reference layer location offsets may be used for but are not limited to achieving ROI scalability. Reference layer location offsets may comprise one or more of scaled reference layer offsets, reference region offsets, and resampling phase sets. Scaled reference layer offsets may be considered to specify the horizontal and vertical offsets between the sample in the current picture that is collocated with the top-left luma sample of the reference region in a decoded picture in a reference layer and the horizontal and vertical offsets between the sample in the current picture that is collocated with the bottom-right luma sample of the reference region in a decoded picture in a reference layer. Another way is to consider scaled reference layer offsets to specify the positions of the corner samples of the upsampled reference region relative to the respective corner samples of the enhancement layer picture. The scaled reference layer offset values may be signed. Reference region offsets may be considered to specify the horizontal and vertical offsets between the top-left luma sample of the reference region in the decoded picture in a reference layer and the top-left luma sample of the same decoded picture as well as the horizontal and vertical offsets between the bottom-right luma sample of the reference region in the decoded picture in a reference layer and the bottom-right luma sample of the same decoded picture. The reference region offset values may be signed. A resampling phase set may be considered to specify the phase offsets used in resampling process of a source picture for inter-layer prediction. Different phase offsets may be provided for luma and chroma components.

Context-based Adaptive Binary Arithmetic Coding (CABAC), a type of entropy coder, is a lossless compression tool to code syntax elements (SEs). SEs are the information that describe how a video has been encoded and how it should be decoded. SEs are typically defined for all the prediction methods (e.g. CU/PU/TU partition, prediction type, intra prediction mode, motion vectors, and etc.) and prediction error (residual) coding information (e.g. residual skip/split, transform skip/split, coefficient last x, coefficient last y, significant coefficient, and etc.). For example in HEVC standard, the total amount of different CABAC has the following steps: Binarization: Syntax elements are mapped to binary symbols (bins). Several different binarizations, such as Unary, Truncated Unary, Exp-Golomb, and fixed length (equal probability) binarization, can be used based on the expected statistics of the syntax element;

Context modelling: The probability of each bin is estimated based on its expected properties and previously coded bins using the same context. Bins with the same behavior and distribution can share the same context. Context is usually defined based on the syntax element, bin position in syntax element, luma/chroma, block size, prediction mode, and/or neighboring information. There are about 200 contexts defined in HEVC standard. During arithmetic coding, each context has a probability state table and determines the probability of the bin that is coded with that context. There are about 128 possible probability states defined in probability state table of HEVC standard;

Arithmetic coding: Bins are coded by arithmetic coding based on the corresponding estimated probabilities. In special cases, bins may be coded with equal probability of 50% (also known as "bypass" coding);

Probability update: Based on the current probability state variable of the context and the value of the coded bit, the probability state variable of the context is updated. For this purpose, a pre-defined update table has been defined in HEVC standard.

A coding tool or mode called intra block copy (IBC) is similar to inter prediction but uses the current picture being encoded or decoded as a reference picture. Obviously, only the blocks coded or decoded before the current block being coded or decoded can be used as references for the prediction. The screen content coding (SCC) extension of HEVC is planned to include IBC.

As mentioned above, pseudo-cylindrically projected spherical images are represented by fewer pixels compared to respective cylindrically projected images(e.g. equirectangular panorama images) due to the fact that polar areas are not stretched. Due to fewer pixels, they may also compress better and are hence good candidates for panoramic image projection formats. However, the boundary of the effective picture area of pseudo-cylindrically projected spherical images is not rectangular and does not match with boundaries of a block grid used in the image/video encoding and decoding process. Blocks including the boundary of the effective picture area contain a sharp edge. Sharp edges are not favorable for image/video coding for example due to the following reasons:

Intra prediction signal is typically not able to reproduce the sharp edge, causing the prediction error signal to be substantial and comprise a sharp edge too.

Sharp edges typically result into substantial high-frequency components being presented in the transform blocks (of the prediction error signal).

- The high-frequency components cause an increase in the bit rate. Many coding schemes have been tuned with the expectation that high-frequency components are less likely and/or with a smaller magnitude than the low-frequency components. For example, the prediction error coding may use the last non-zero coefficient in zig-zag order (i.e. from low to high frequencies).

The quantization of high-frequency components causes visible artefacts, such as ringing, for the entire decoded block (particularly in the proximity of the sharp edge).

The present embodiments improve the compression efficiency of coding of pseudo-cylindrically projected images (i.e., reduce the bitrate while keeping the picture quality unchanged) and reduce the visible artefacts of the boundary areas of the pseudo-cylindrically projected spherical images. In a method according to an embodiments, a 360-degree panorama picture is received as an input. The panorama picture has a non-rectangular effective picture area. The picture area outside the effective picture area is absent or comprises ignorable sample values. The picture area outside the effective picture area may e.g. comprise samples with black color (black-level luma and zero-level chroma). The 360-degree panorama picture is placed on a (rectangular) block grid. The 360-degree panorama picture may e.g. represent a pseudo-cylindrical mapping of a spherical image. Figure 6 illustrates pseudo-cylindrical image with an effective picture area 610 indicated by a solid line. The rectangular block grid 600 is depicted with a dashed line.

The input image is processed to become a reshaped picture that is non-rectangular and at least partially block-aligned. For that, one or more boundary blocks (e.g. blocks 601, 602, 603, 604) that contain a boundary of the effective picture area 610 are processed. Each of these boundary blocks (e.g. blocks 601, 602, 603, 604) may be processed in one of the following three examples 1-3, or a combination thereof. The alternatives may be combined for example by moving some of the samples of a boundary block (similarly to example 3 below) and removing the remaining samples of the boundary block (similarly to example 1 below). A different processing may be selected on boundary block basis.

Example 1 : Removal of the boundary block

A boundary block may be removed. This can be effective particularly when there are only few samples within the boundary block that are also within the effective picture area. These samples may not be crucial for the displaying of the panorama image. It may be possible to interpolate or estimate the sample values of the removed samples in the decoding end. Figure 7 shows an example of a selection of a boundary block 710 that may suit removal of the boundary block. As can be seen, this selected boundary block 710 comprises only relatively few samples within the effective picture area. Figure 8 presents the reshaped picture resulting from the removal of the boundary block 710 shown in Figure 7. Example 2: Copying sample values from an opposite-side boundary region of the effective picture area

The example 2 is explained with reference to Figure 9, although it is to be understood that the example is not limited to the exact setup in Figure 9, such as the selection of boundary blocks within the block grid, or whether the copying takes place from the right-side of the effective picture area to fill in areas on the left of the effective picture area (as in Figure 9). Boundary blocks 910, 920 may be filled by copying pixel values from boundary blocks 930, 940 950 on the other side the image (Figure 9). This can be effective particularly when there are significant amount of samples within the boundary block that are also within the effective picture area, but few samples within the boundary block but are outside the effective picture area. These empty parts of the blocks 910, 920 may be filled with data from boundary blocks 930, 940, 950 on the other side to make content of the blocks 910, 920 continuous. The blocks 910, 920 can therefore be more efficiently compressed compared to the same blocks without filling them with data from the opposite side of the effective picture area. Example 3: Moving samples from an opposite-side boundary region of the effective picture area

The example 3 is explained with reference to Figures 10 and 1 1 , although it is to be understood that the example is not limited to the exact setup in Figures 10 or 11 , such as the selection of boundary blocks within the block grid, or whether the moving takes place from the right-side of the effective picture area to fill in areas on the left of the effective picture area. Moving samples can be used for two purposes: filling a boundary block 1010 (Figure 10) or emptying a boundary block (Figure 11). As shown in Figure 10, boundary blocks 1010, 1020 may be filled by moving pixel values from boundary blocks 1030, 1040, 1050 other side of the image. In this case, most of the pixels of the blocks 1030, 1040, 1050 in the other side may be moved to the boundary blocks 1010, 1020.

In other case, shown in Figure 11, content of a boundary block 1130 may be moved to the other side to generate an empty boundary block 1130. This can be effective for example, when there are few but important samples in the boundary block, and there is enough space in the boundary block of other side. In Figure 11, content of the block boundary 1130 at the right side is moved to the left boundary blocks 1109, 1110.

The copying or moving in Example 2 or Example 3, respectively, may take place for example on sample row basis. When copying or moving samples from the right part of the effective picture area to extend the effective picture boundary to the left (e.g. as in Figures 9, 10, or 11), the sample row based copying or moving may be done as follows. In each sample row, the sample to be filled in closest to the left boundary of the effective picture area is copied or moved from the sample location closest to the boundary on the opposite side within the effective picture area, the sample to be filled in second closest to the left boundary of the effective picture area is copied or moved from the sample location second closest to the boundary on the opposite side within the effective picture area, and so on.

The reshaped picture (i.e., a picture that is reshaped according to any one or more of the previous examples 1 - 3) is then processed to obtain a rectangular picture. The sample values for the areas outside the reshaped picture may be obtained e.g. by either or both of: iii) any signal extrapolation method, such as copying the boundary sample of the picture sample-row-wise to the adjacent area outside the reshaped picture; iv) setting the sample values to a pre-defined value. In an embodiment, in the decoder side, the reconstruction process of the original pseudo-cylindrical image shape is executed as follows. The decoder is configured to decode a rectangular picture from a bitstream into a decoded rectangular picture. Then, the decoder is configured to determine a first effective picture area representing 360-degree panorama picture, wherein the first effective picture area is non-rectangular. The first effective picture area represents the shape of the non-rectangular 360-degree panorama picture that was given as input to encoding.

The decoder is configured to determine a second effective picture area representing a reshaped picture that is non-rectangular and at least partly block aligned. The second effective picture area within the decoded rectangular picture contains a representation of the pixels of the non-rectangular 360-degree panorama picture that was given as input to encoding.

The first effective picture area and/or the second effective picture area may be indicated in the bitstream and/or predefined e.g. in a coding standard. For example, one or more of the following may be used: i) it may be indicated that the coded picture represents a particular pseudo-cylindrical projection of a spherical picture. It may be pre-defined that a particular first processing done according to a pre-defined algorithm was performed in the encoding, hence determining the second effective picture area; ii) the first effective picture area may be provided through a mathematical function, whose coefficient values may be provided in the bitstream; iii) the first effective picture area and/or the second effective picture area may be provided using a mask image.

The decoder is configured to identify a boundary block of in which the first effective picture area and the second effective picture area do not match. Some samples may not have appropriate values in the first effective picture area within the boundary block. For example, on the left-hand side of the picture, the boundary of the first effective picture area may reside to the right of the boundary of the second effective picture area. Sample values for such sample locations are obtained by using one or more of the following techniques (technique 1, technique 2, technique 3). When more than one of the following techniques is used, the order of applying the techniques may be pre-defined, e.g. in a coding standard or may be concluded e.g. by decoding ordering indications from the bitstream.

Technique 1 : Interpolating or extrapolating removed boundary samples:

The decoder is configured to determine e.g. using knowledge of the first effective picture area, the second effective picture area, the first processing used in encoding that a sample has been removed in the first processing used in encoding. The removed samples in the boundary block can be reconstructed from the neighboring samples and samples from the opposite side of the image (within the first or second effective picture area) through interpolation and/or extrapolation.

Technique 2: Copied samples from opposite-side:

The decoder is able to detect the extra samples which are copied to the outside of the effective picture area (Figure 9). The extra samples can be set e.g. to a determined value, such as a value representing black color, or can be effectively removed from the 360-degree panorama image.

Technique 3: Moved samples from opposite side:

The decoder is able to detect the extra samples which have been moved from the opposite-side of the image (Figure 10 and 11) and also the empty area inside the effective picture area corresponding to the locations of the moved pixels. The decoder can reconstruct the 360-degree panorama image by moving back the samples to the empty area in the opposite-side of the picture.

By repeating the process to each boundary block in which sample values within the first effective picture areas are not proper, the 360-degree panorama picture having the first effective picture area is reconstructed. An embodiment of an encoding method is illustrated in Figure 12. The method comprises determining 1210 an effective picture area representing 360-degree panorama picture, the effective picture area being non- rectangular; obtaining 1220 a mapped 360-degree panorama picture covering the effective picture area; obtaining 1230 a reshaped picture from the mapped 360-degree panorama picture using a first processing to at least one boundary block containing a boundary of the effective picture area; the reshaped picture being non-rectangular and at least partially block-aligned, said first processing 1240 comprising: identifying a boundary block containing a boundary of the effective picture area; removing zero or more samples of the boundary block and setting sample values of samples in the boundary block and outside the effective picture area in one or both of the following ways: 1) copying sample values from an opposite-side boundary region of the effective picture area; 2) moving samples from an opposite-side boundary region of the effective picture area; using 1250 a second processing to the reshaped picture to obtain a rectangular picture, the second processing differing from the first processing; and encoding 1260 the rectangular picture into a bitstream.

An embodiment of a decoding method is illustrated in Figure 13. The method comprises determining 1320 a first effective picture area representing 360-degree panorama picture, the first effective picture area being non-rectangular; determining 1330 a second effective picture area representing a reshaped picture that is non-rectangular and at least partly block-aligned; decoding 1335 a rectangular picture from a bitstream, the decoded rectangular picture having the second effective picture area; obtaining 1340 a 360-degree panorama picture with the first effective picture area from the decoded rectangular picture using first processing 1350 comprising: identifying that a boundary block in which the first effective picture area and the second effective picture area do not match; processing the boundary block using a first processing, the first processing comprising one or more of the following: interpolating or extrapolating zero or more samples of the boundary block; copying samples from an opposite-side boundary region of the second effective picture area to the boundary block; moving samples from an opposite-side boundary region of the second effective picture area to the boundary block; setting samples outside the first effective picture area to a determined value, such as value representing black color.

According to an embodiment, a mapped 360-degree panorama picture is resampled to another spatial resolution prior to its processing according to any presented encoding method. The resampling factor may be chosen e.g. to align the picture width with the block grid.

According to an embodiment, any of the presented decoding methods can be applied for intra picture decoding in a video decoder. The 360-degree panorama picture obtained by applying the presented decoding methods may be used as a reference picture for decoding other pictures, e.g. in an inter prediction process and/or an inter-layer prediction process.

According to an embodiment, any of the presented decoding methods is applied for intra picture encoding in a video encoder. A reconstructed 360-degree panorama picture is obtained similarly to applying the presented decoding methods; however, bitstream decoding may be omitted. The reconstructed 360-degree panorama picture may be used as a reference picture for encoding other pictures, e.g. in an inter prediction process and/or an inter-layer prediction process.

According to an embodiment, any of the presented decoding methods is applied as post-processing to video decoding, i.e., the decoded rectangular picture may be used as a reference picture for decoding other pictures, e.g., in an inter prediction process and/or an inter-layer prediction process. According to an embodiment, any of the presented decoding methods is applied to image decoding within an image decoding process or as post-processing to image decoding.

According to an embodiment, any of the presented encoding methods is applied as pre-processing to video encoding, i.e., the reconstructed rectangular picture (obtained from the encoded rectangular picture) may be used as a reference picture for encoding other pictures, e.g. in an inter prediction process and/or an inter- layer prediction process.

According to an embodiment, any of the presented encoding methods is applied to image encoding within an image encoding process or as pre-processing to image encoding.

According to an embodiment, which may be applied together with or independently of other embodiments, after a 360 -degree panorama picture with a pseudo-cylindrical effective picture area has been reconstructed (in an encoder) or decoded (in a decoder), some or all areas outside the effective picture area are filled in as described in the following prior to storing the reconstructed/decoded picture as a reference picture for inter prediction and/or inter-layer prediction. Non-effective picture area of the reconstructed/decoded picture are filled in by copying samples from the opposite-side of the effective picture area (e.g., on sample row basis, the first sample outside the effective picture area is copied from the first sample inside the effective picture area in the opposite side, the second sample outside the effective picture area is copied from the second sample inside the effective picture area in the opposite side, and so on). Consequently, continuity of the picture data is obtained across the boundary of the effective picture area. Such copying may be performed for example for a certain horizontal margin beside the effective picture area. The copying may be followed by padding or other signal extrapolation to fill in the sample locations that remained unpadded. The reconstructed/decoded picture with processed areas outside the effective picture area is then stored as a reference picture for inter prediction and/or inter-layer prediction.

The embodiment described in the previous paragraph can improve compression efficiency by making inter prediction and/or inter-layer prediction more precise due to the following reasons. The non-effective picture area is filled by samples from opposite-side of the effective picture area, which provides continuity in the boundary areas of the reference picture. Motion vectors used in inter prediction and/or inter-layer prediction can therefore point to a prediction block that is partly or fully outside the effective picture area, while the prediction block remains continuous. Fractional-sample interpolation for prediction samples inside but close to the effective picture boundary may use samples outside the effective picture as input. Filling in the samples outside the effective picture area may therefore make the input signal for fractional sample interpolation more continuous and hence improve the preciseness of fractional sample values. The process of filling in some or all areas outside the effective picture area prior to storing the reconstructed/decoded picture as a reference picture for inter prediction and/or inter-layer prediction may be performed before or after applying in-loop filtering, such as deblocking filtering or sample adaptive offset filtering, or may be applied in between certain filtering steps.

It needs to be understood that rather than copying samples into the non-effective picture area from the opposite-side of the effective picture area, embodiments can be realized by wrapping over the sample locations used for intra prediction, inter prediction, and/or inter-layer prediction to be within the effective picture area, but on the opposite side. This may apply to both fractional sample interpolation and selecting samples at integer sample positions for intra prediction, inter prediction, and/or inter-layer prediction.

In an embodiment, after reconstructing (in an encoder) or decoding (in a decoder) a prediction error block, the samples of the prediction error block that outside the effective picture area are set to 0. The modified prediction error block may then be summed with the respective prediction block to obtain a reconstructed or decoded block.

In the above, some embodiments have been described by assuming that curves of constant latitude in the spherical domain are projected into horizontal lines (with constant vertical coordinate value) in the mapped 360-degree panorama picture. More generally, these embodiments apply when curves of constant latitude are projected into contiguous curves with a wrap over point at the boundary of the effective picture area, in which case the samples of the opposite-side boundary region are found along the same curve of constant latitude. Some projections may result into curves of constant longitude in the spherical domain to map into contiguous curves with a wrap over point at the boundary of the effective picture area, in which case the samples of the opposite-side boundary region are found along the same curve of constant longitude.

The various embodiments may provide advantages. For example, the embodiments improve the compression efficiency of coding of pseudo-cylindrically projected spherical images. This means that the present embodiments reduce the bitrate while keeping the picture quality unchanged. In addition, the present embodiments reduce the visible artefacts of the boundary areas of the pseudo-cylindrically projected spherical images.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Further, a network device, such as a server, may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. It is obvious that the present invention is not limited solely to the above-presented embodiments, but it can be modified within the scope of the appended claims.

Claims

Claims:

1. A method comprising:

determining an effective picture area representing 360-degree panorama picture, the effective picture area being non-rectangular;

obtaining a mapped 360-degree panorama picture covering the effective picture area;

obtaining a reshaped picture from the mapped 360-degree panorama picture using a first processing to at least one boundary block containing a boundary of the effective picture area; the reshaped picture being non-rectangular and at least partially block-aligned, said first processing comprising:

o identifying a boundary block containing a boundary of the effective picture area; o performing one or more of the following:

o removing zero or more samples of the boundary block and within the effective picture area;

o setting sample values of samples in the boundary block and outside the effective picture area in one or both of the following ways:

• copying sample values from an opposite-side boundary region of the effective picture area;

• moving samples from an opposite-side boundary region of the effective picture area;

- using a second processing to the reshaped picture to obtain a rectangular picture, the second processing differing from the first processing;

encoding the rectangular picture into a bitstream.

2. The method according to claim 1, wherein the effective picture area is determined by a mapping applied to a source picture.

3. The method according to claim 2, wherein said source picture is an equirectangular 360-degree picture.

4. The method according to claim 2, wherein said source picture is a spherical picture.

5. The method according to claim 2, wherein said mapping is a pseudo-cylindrical mapping.

6. The method according to claim 2, wherein said mapping is specified as a mathematical function.

7. The method according to claim 1 , wherein the boundary of the effective picture area is bilaterally symmetric with a horizontal symmetry axis and/or a vertical symmetry axis.

8. The method according to claim 1, further comprising encoding one or more indications into the bitstream, the one or more indications indicative of one or more of the following:

the second processing;

the first processing;

the effective picture area;

the encoded rectangular picture representing a 360-degree panorama picture.

The method according to claim 1, wherein said second processing comprises:

identifying a second boundary block comprising a boundary of the block-aligned non- rectangular picture;

setting sample values of samples adjacent to the second boundary block and outside block- aligned non-rectangular picture in one of the following ways:

extrapolating boundary sample values of the second boundary block; and

deriving sample values at least partially from the second boundary block.

10. The method according to claim 9, wherein said extrapolating matches with an intra prediction.

11. The method according to claim 10, wherein the intra prediction is a horizontal intra prediction.

12. A method comprising:

determining a first effective picture area representing 360-degree panorama picture, the first effective picture area being non-rectangular;

determining a second effective picture area representing a reshaped picture that is non- rectangular and at least partly block-aligned;

decoding a rectangular picture from a bitstream, the decoded rectangular picture having the second effective picture area;

- obtaining a 360-degree panorama picture with the first effective picture area from the decoded rectangular picture using first processing comprising:

o identifying a boundary block in which the first effective picture area and the second effective picture area do not match;

o processing the boundary block using a first processing, the first processing comprising one or more of the following:

• interpolating or extrapolating zero or more samples of the boundary block; • moving samples from an opposite-side boundary region of the second effective picture area to the boundary block;

• setting samples outside the first effective picture area to a determined value.

13. The method according to claim 12, wherein the determined value is a value representing black color.

14. The method according to claim 12, wherein the first effective picture area is determined by a mapping applied to a source picture.

15. The method according to claim 14, wherein the source picture is an equirectangular 360-degree picture.

16. The method according to claim 14, wherein the source picture is a spherical picture.

17. The method according to claim 14, wherein the mapping is a pseudo-cylindrical mapping.

18. The method according to claim 14, wherein the said mapping is specified as a mathematical function.

19. The method according to claim 12, wherein the boundary of the first effective picture area is bilaterally symmetric with a horizontal symmetry axis and/or a vertical symmetry axis.

20. The method according to claim 12, further comprising decoding one or more indications from the bitstream, the one or more indications indicative of one or more of the following:

the first processing;

the first effective picture area;

the second effective picture area;

the rectangular picture representing a 360-degree panorama picture.

21. An apparatus comprising at least one processor; at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

determine an effective picture area representing 360-degree panorama picture, the effective picture area being non-rectangular; obtain a mapped 360-degree panorama picture covering the effective picture area;

obtain a reshaped picture from the mapped 360-degree panorama picture using a first processing to at least one boundary block containing a boundary of the effective picture area; the reshaped picture being non-rectangular and at least partially block-aligned, said first processing comprising:

o identifying a boundary block containing a boundary of the effective picture area;

o performing one or more of the following:

o removing zero or more samples of the boundary block and within the effective picture area: o setting sample values of samples in the boundary block and outside the effective picture area in one or both of the following ways:

■ copying sample values from an opposite-side boundary region of the effective picture area;

^■ moving samples from an opposite-side boundary region of the effective picture area;

use a second processing to the reshaped picture to obtain a rectangular picture, the second processing differing from the first processing;

encode the rectangular picture into a bitstream.

22. The apparatus according to claim 21, wherein the effective picture area is determined by a mapping applied to a source picture.

23. The apparatus according to claim 22, wherein said source picture is an equirectangular 360-degree picture.

24. The apparatus according to claim 22, wherein said source picture is a spherical picture.

25. The apparatus according to claim 22, wherein said mapping is a pseudo-cylindrical mapping.

26. The apparatus according to claim 22, wherein said mapping is specified as a mathematical function.

27. The apparatus according to claim 21 , wherein the boundary of the effective picture area is bilaterally symmetric with a horizontal symmetry axis and/or a vertical symmetry axis.

28. The apparatus according to claim 21, further comprising computer program code to cause the apparatus to encode one or more indications into the bitstream, the one or more indications indicative of one or more of the following:

the second processing; the first processing;

the effective picture area;

the encoded rectangular picture representing a 360-degree panorama picture.

29. The apparatus according to claim 21, wherein said second processing comprises:

- extrapolating boundary sample values of the second boundary block; and

deriving sample values at least partially from the second boundary block.

30. The apparatus according to claim 29, wherein said extrapolating matches with an intra prediction.

31. The apparatus according to claim 30, wherein the intra prediction is a horizontal intra prediction.

32. An apparatus comprising at least one processor; at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- determine a first effective picture area representing 360-degree panorama picture, the first effective picture area being non-rectangular;

determine a second effective picture area representing a reshaped picture that is non-rectangular and at least partly block-aligned;

decode a rectangular picture from a bitstream, the decoded rectangular picture having the second effective picture area;

obtain a 360-degree panorama picture with the first effective picture area from the decoded rectangular picture using first processing comprising:

• interpolating or extrapolating zero or more samples of the boundary block;

• moving samples from an opposite-side boundary region of the second effective picture area to the boundary block;

· setting samples outside the first effective picture area to a determined value.

33. The apparatus according to claim 32, wherein the determined value is a value representing black color.

34. The apparatus according to claim 32, wherein the first effective picture area is determined by a mapping applied to a source picture.

35. The apparatus according to claim 34, wherein the source picture is an equirectangular 360-degree picture.

36. The apparatus according to claim 34, wherein the source picture is a spherical picture.

37. The apparatus according to claim 34, wherein the mapping is a pseudo-cylindrical mapping.

38. The apparatus according to claim 34, wherein the said mapping is specified as a mathematical function.

39. The apparatus according to claim 32, wherein the boundary of the first effective picture area is bilaterally symmetric with a horizontal symmetry axis and/or a vertical symmetry axis.

40. The apparatus according to claim 32, further comprising computer program code to cause the apparatus to decode one or more indications from the bitstream, the one or more indications indicative of one or more of the following:

the first processing;

the first effective picture area;

the second effective picture area;

the rectangular picture representing a 360-degree panorama picture.

41. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

determine an effective picture area representing 360-degree panorama picture, the effective picture area being non-rectangular;

obtain a mapped 360-degree panorama picture covering the effective picture area; obtain a reshaped picture from the mapped 360-degree panorama picture using a first processing to at least one boundary block containing a boundary of the effective picture area; the reshaped picture being non-rectangular and at least partially block-aligned, said first processing comprising:

o removing zero or more samples of the boundary block and within the effective picture area:

· copying sample values from an opposite-side boundary region of the effective picture area;

encode the rectangular picture into a bitstream.

42. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

determine a first effective picture area representing 360-degree panorama picture, the first effective picture area being non-rectangular;

- decode a rectangular picture from a bitstream, the decoded rectangular picture having the second effective picture area;

• interpolating or extrapolating zero or more samples of the boundary block;

• moving samples from an opposite-side boundary region of the second effective picture area to the boundary block; • setting samples outside the first effective picture area to a determined value.