WO2014102088A1

WO2014102088A1 - Decoder and encoder and methods therein for handling frame packing layering

Info

Publication number: WO2014102088A1
Application number: PCT/EP2013/076880
Authority: WO
Inventors: Jonatan Samuelsson; Rickard Sjöberg
Original assignee: Telefonaktiebolaget L M Ericsson (Publ)
Priority date: 2012-12-28
Filing date: 2013-12-17
Publication date: 2014-07-03

Abstract

A decoder (50) and a method therein for managing multiple views of a coded video sequence as well as an encoder (80) and a method therein for encoding multiple views of a video sequence into a coded video sequence are disclosed. The multiple views are temporally interleaved. The decoder (50) obtains (207) an identity associated with a Network Abstraction Layer "NAL" unit of the coded video sequence. The identity relates to a temporal layer. The decoder (50) discards (208) the NAL unit when the identity is above or equal to a threshold value. The encoder encodes (201) all pictures belonging to a first view into one or more NAL units, wherein each NAL unit has a respective temporal identity equal to or less than a first value. The encoder (80) encodes(202) all pictures belonging to a second view into one or more NAL units, wherein each NAL unit has a respective temporal identity greater than the first value. Computer programs (55, 85) and computer program products (56, 86). A network node (110) for managing multiple views of a coded video sequence is disclosed.

Description

DECODER AND ENCODER AND METHODS THEREIN FOR HANDLING FRAME PACKING

LAYERING

TECHNICAL FIELD

Embodiments herein relate to video coding. In particular, a decoder and a method therein for managing multiple views of a coded video sequence as well as an encoder and a method therein for encoding multiple views of a video sequence into a coded video sequence are disclosed. Moreover, corresponding computer programs and computer program products are disclosed. Finally, a network node for managing multiple views of a coded video sequence is disclosed. BACKGROUND

With video coding technologies, it is often desired to compress a video sequence into a coded video sequence. The video sequence may for example have been captured by a video camera. A purpose of compressing the video sequence is to reduce a size, e.g. in bits, of the video sequence. In this manner, the coded video sequence will require smaller memory when stored and/or less bandwidth when transmitted from e.g. the video camera. A so called encoder is often used to perform compression, or encoding, of the video sequence. Hence, the video camera may comprise the encoder. The coded video sequence may be transmitted from the video camera to a display device, such as a television set (TV) or the like. In order for the TV to be able to decompress, or decode, the coded video sequence, it may comprise a so called decoder. This means that the decoder is used to decode the received coded video sequence. In other scenarios, the encoder may be comprised in a radio base station of a cellular

communication system and the decoder may be comprised in a wireless device, such as a cellular phone or the like, and vice versa.

A known video coding technology is called High Efficiency Video Coding (HEVC), which is a new video coding standard, currently being developed by Joint Collaborative Team - Video Coding (JCT-VC). JCT-VC is a collaborative project between Moving Pictures Expert Group (MPEG) and International Telecommunication Union's Telecommunication Standardization Sector (ITU-T). A coded picture of an HEVC bitstream is included in an access unit, which comprises a set of Network Abstraction Layer (NAL) units. NAL units are thus a format of packages which form the bitstream. The coded picture can consist of one or more slices with a slice header, i.e. one or more Video Coding Layer (VCL) NAL units, that refers to a Picture Parameter Set (PPS), i.e. a NAL unit identified by NAL unit type PPS. A slice is a spatially distinct region of the coded picture, aka a frame, which is encoded separately from any other region in the same coded picture. The PPS contains information that is valid for one or more coded pictures. Another parameter set is referred to as a Sequence Parameter Set (SPS). The SPS contains information that is valid for an entire Coded Video Sequence (CVS) such as cropping window parameters that are applied to pictures when they are output from the decoder. Pictures that have been decoded are stored by the decoder in a Decoded Picture Buffer (DPB) in their original format, that is to say 'not cropped' in this example.

HEVC defines a Video Usability Information (VUI) syntax structure that can be present in the SPS and contains parameters that do not affect the decoding process. Supplemental Enhancement Information (SEI) is another structure that can be present in any access unit and that contains information that does not affect the decoding process. An access unit consists of a coded picture, in one or more VCL NAL units, and additional Network Abstraction Layer (NAL) units, comprising parameter sets or SEI messages.

HEVC defines a Frame Packing Arrangement (FPA) SEI message used to indicate that the coded video sequence in the bitstream consists of two representations, hereafter called views, such as a view A and a view B. Thus, frame packing implies multiple views in one coded video sequence. Typically the SEI message is used for coding of stereo video (piano- stereoscopic) where view A represents the left view and view B represents the right view, although it is not necessary that they represent the two views of a stereo sequence, the second "view" can alternatively for example represent additional chroma information of the first view or depth information of the first view. It could also e.g. be texture and depth or even two completely unrelated sequences encoded in together.

HEVC defines temporal sub-layers. For each picture the variable Temporalld, calculated from the syntax element nuh_temporal_id_plus1 , in the NAL unit header, indicates which temporal sub-layer the picture belongs to. A lower temporal sub-layer cannot depend on a higher temporal sub-layer and a sub-bitstream extraction process requires that when one or more of the highest temporal sub-layers are removed from a bitstream the remaining bitstream shall be a conforming bitstream. As an example, lower temporal sub-layers may be associated with a display rate, or bit rate, that is lower than a display rate, or a bit rate, corresponding to a higher temporal sub-layer. It shall be understood that temporal sub-layers enable temporal scalability of the bitstream while only decoding the NAL unit header, i.e. there is no need to parse parts of the bitstream which are to be removed when it is desired to for example decrease the required bit rate. This means that each picture is associated with a temporal identity, such as the variable Temporalld. As an example, A picture with Temporalld A may not reference any picture with Temporalld B if B > A. It is required that when removing all pictures, or NAL units, with Temporalld higher than N the remaining bitstream shall be a conforming bitstream, i.e. a decodable bitstream that complies to the HEVC specification. Temporalld can be used by network nodes or decoders to perform temporal scaling or trick-play, e.g. fast forward playout.

A Random Access Point (RAP), also referred to as Intra RAP (I RAP), picture is a picture that contains only I slices. The first picture in the bitstream must be a RAP picture. Provided that the necessary parameter sets are available when they need to be activated, the RAP picture and all subsequent pictures in both decoding order and output order can be correctly decoded without decoding any picture that precede the RAP picture in decoding order.

Broken Link Access (BLA) pictures, Instantaneous Decoder Refresh (IDR) pictures and Clean Random Access (CRA) pictures are the three types of RAP pictures defined in HEVC.

A CVS consists of a BLA or an IDR picture and all pictures up until, but not including the next BLA or IDR picture. The first coded video sequence of a bitstream may start with CRA picture instead of a BLA or IDR picture.

The FPA SEI message contains a syntax element called

frame_packing_arrangement_type which indicates what type of frame packing that is being used. The following 8 values of frame_packing_arrangement_type are defined:

Value Interpretation

0 Each component plane of the decoded frames contains a "checkerboard" based interleaving of corresponding planes of two constituent frames as illustrated in Figure 12.

1 Each component plane of the decoded frames contains a column based interleaving of corresponding planes of two constituent frames as illustrated in Figure 13 and 14.

2 Each component plane of the decoded frames contains a row based

interleaving of corresponding planes of two constituent frames as illustrated in Figure 15 and 16.

3 Each component plane of the decoded frames contains a side-by-side

packing arrangement of corresponding planes of two constituent frames as illustrated in Figure 17, 18 and 21.

4 Each component plane of the decoded frames contains a top-bottom

packing arrangement of corresponding planes of two constituent frames as illustrated in Figure 19 and 20.

5 The component planes of the decoded frames in output order form a

temporal interleaving of alternating first and second constituent frames as illustrated in Figure 22.

6 Each decoded frame constitutes a single frame without the use of a frame packing of multiple constituent frames (see NOTE 5 in section "Appendix").

7 Each component plane of the decoded frames contains a rectangular region frame packing arrangement of corresponding planes of two constituent frames as illustrated in Figure D-12 (missing). Switching from one type of frame packing to another or switching from frame packed (3D) to single view coding (2D) is referred to as "format switch". A format switch is generally performed at the start of a new CVS (beginning with a BLA or IDR picture) but it is currently not prohibited to perform format switching within CVSs. Frame packing type 5 represents temporal interleaving. With this frame packing type every second picture belongs to each view, e.g. picture 0, 2, 4, 6, etc. belongs to view A and picture 1 , 3, 5, 7 etc. belongs to view B.

SUMMARY

An object is to improve coding of HEVC compliant bitstreams, in particular

encoding/decoding of frame packing arrangement type 5.

According to a first aspect, the object is achieved by a method, performed by a decoder, for managing multiple views of a coded video sequence. The multiple views are temporally interleaved with respect to output order of the coded video sequence. The decoder obtains an identity associated with a NAL unit of the coded video sequence. The identity relates to a temporal layer of the NAL unit. The decoder discards the NAL unit when the identity is above or equal to a threshold value for separating the multiple views of the coded video sequence.

According to a second aspect, the object is achieved by a decoder configured to manage multiple views of a coded video sequence. The multiple views are temporally interleaved with respect to output order of the coded video sequence. The decoder comprises a processing circuit configured to obtain an identity associated with a NAL unit of the coded video sequence. The identity relates to a temporal layer of the NAL unit. Furthermore, the processing circuit is configured to discard the NAL unit when the identity is above or equal to a threshold value for separating the multiple views of the coded video sequence.

According to a third aspect, the object is achieved by a method, performed by an encoder, for encoding multiple views of a video sequence into a coded video sequence. The multiple views comprise a first view and a second view. The encoder encodes all pictures belonging to the first view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity equal to or less than a first value. The encoder encodes all pictures belonging to the second view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity greater than the first value.

According to a fourth aspect, the object is achieved by an encoder configured to encode multiple views of a video sequence into a coded video sequence. The multiple views comprise a first view and a second view. The encoder comprises a processing circuit configured to encode all pictures belonging to the first view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity equal to or less than a first value. Furthermore, the processing circuit is configured to encode all pictures belonging to the second view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity greater than the first value.

According to a fifth aspect, the object is achieved by a computer program, comprising computer readable code units which when executed on a decoder causes the decoder to perform the methods in the decoder described herein.

According to a sixth aspect, the object is achieved by a computer program product, comprising a computer readable medium and a computer program as described herein stored on the computer readable medium.

According to a seventh aspect, the object is achieved by a computer program, comprising computer readable code units which when executed on an encoder causes the encoder to perform the methods in the encoder described herein. According to an eighth aspect, the object is achieved by a computer program product, comprising a computer readable medium and a computer program as described herein stored on the computer readable medium.

According to a nineth aspect, the object is achieved by a network node configured to manage multiple views of a coded video sequence. The multiple views are temporally interleaved with respect to output order of the coded video sequence. The network node is configured to obtain an identity associated with a NAL unit of the coded video sequence. The identity relates to a temporal layer of the NAL unit. Moreover, the network node is configured to discard the NAL unit when the identity is above or equal to a threshold value for separating the multiple views of the coded video sequence. Furthermore, the network node is configured to forward the NAL unit when the identity is below the threshold value. The identity, or the temporal identity, is used to separate different views in a temporally interleaved frame packing arrangement in which the different views belong to different temporal layers. In this manner, a view indicated by a higher temporal layer is allowed to be discarded, i.e. removed from a bitstream comprising the coded video sequence, while a portion of the bitstream, i.e. the portion that is kept, still may be decoded by the decoder.

An advantage with at least one embodiment is that a possibility is provided to selectively decode a single view in the case that the bitstream contains two views but the decoder is only capable of displaying one view. The decoder may thus be capable of displaying a two dimensional view. BRIEF DESCRIPTION OF THE DRAWINGS

The various aspects of embodiments disclosed herein, including particular features and advantages thereof, will be readily understood from the following detailed description and the accompanying drawings, in which: Figure 1 is a schematic overview of an exemplifying radio communication system in which embodiments herein may be implemented,

Figure 2 is a schematic, combined signaling scheme and flowchart illustrating embodiments of the methods when performed in the radio communication system according to Figure 1 ,

Figure 3 is a flowchart illustrating embodiments of the method in the decoder,

Figure 4 is a block diagram illustrating embodiments of the decoder,

Figure 5 is another block diagram illustrating embodiments of the decoder,

Figure 6 is a further block diagram illustrating embodiments of the decoder,

Figure 7 is yet another block diagram illustrating embodiments of the decoder,

Figure 8 is a flowchart illustrating embodiments of the method in the encoder,

Figure 9 is another block diagram illustrating embodiments of the encoder,

Figure 10 is a further block diagram illustrating embodiments of the encoder,

Figure 1 1 is yet another block diagram illustrating embodiments of the encoder, and Figures 12-22 are block diagrams illustrating exemplifying frame packing arrangements according to prior art. DETAILED DESCRIPTION

Throughout the following description similar reference numerals have been used to denote similar elements, units, modules, circuits, nodes, parts, items or features, when applicable. In the Figures, features that appear in some embodiments may be indicated by dashed lines.

Figure 1 depicts a scenario which embodiments herein are implemented in an exemplifying communications system, such as a radio communications system 100. In this example, the radio communications system 100 is a Long Term Evolution (LTE) system. In other examples, the radio communication system may be any 3GPP cellular communication system, such as a Wideband Code Division Multiple Access (WCDMA) network, a Global System for Mobile communication (GSM network) or the like. In yet further examples, the communication system may be a fixed network capable of broadcasting Internet Protocol Television (IP TV).

In other scenarios, the decoder 50 and/or the encoder 80 may be comprised in television set-top-boxes, video players/recorders, such as video cameras, Blu-ray players, Digital Versatile Disc(DVD)-players, media centers, media players and the like.

The radio communication system 100 comprises a network node 110. As used herein, the term "network node" may refer to an evolved Node B (eNB), a control node controlling one or more Remote Radio Units (RRUs), a radio base station, an access point or the like.

Furthermore, a decoder 50 and an encoder 80 are shown. In one example, the network node 1 10 may comprise the decoder 50. In other examples, a user equipment 120 may comprise the decoder 50.

As used herein, the term "user equipment" may refer to a mobile phone, a cellular phone, a Personal Digital Assistant (PDA) equipped with radio communication capabilities, a smartphone, a laptop or personal computer (PC) equipped with an internal or external mobile broadband modem, a tablet PC with radio communication capabilities, a portable electronic radio communication device, a sensor device equipped with radio communication capabilities or the like. The sensor may be any kind of weather sensor, such as wind, temperature, air pressure, humidity etc. As further examples, the sensor may be a light sensor, an electronic switch, a microphone, a loudspeaker, a camera sensor etc.

The decoder 50 and the encoder 80 may send, via wired or wireless connections 131,

132, video data, such as coded video sequences, between each other. Figure 2 illustrates exemplifying methods in the decoder 50 and the encoder 80, which are shown in Figure 1. Thus, the decoder 50 performs a method for managing multiple views of a coded video sequence. In more detail, the decoder 50 may, as described below, extract and discard one view out of multiple views of the coded video sequence. The encoder 80 performs a method for encoding multiple views of a video sequence into the coded video sequence.

The coded video sequence may be a HEVC compliant coded video sequence, such as a coded video sequence according to frame packing arrangement type 5. The coded video sequence may contain two views, i.e. a right and left view of a stereoscopic representation.

For the decoding, the multiple views are temporally interleaved with respect to output order of the coded video sequence. For the encoding, the multiple views comprise a first view and a second view.

In the following description of the actions in the encoder 80 and the decoder 50, the action performed in the encoder 80 will be described first for reasons of simplicity.

Hence, one or more of the following actions may be performed in any suitable order.

In order to separate the first view and the second view from each other, the encoder 80 performs action 201 and 202 below. The actions may be performed in parallel or in reversed order as compared to what is described below. Typically, actions 201 and 202 are performed in an interleaved fashion such that one picture from a view A is encoded then the corresponding picture from a view B and then the next picture from view A is encoded and so on.

Action 201

The encoder 80 encodes all pictures belonging to the first view into one or more NAL units of the coded video sequence, where each NAL unit may not comprise more than one picture. Each NAL unit has a respective temporal identity equal to or less than a first value. Moreover, the coded video sequence may comprise at least one picture with its respective temporal identity being equal to zero.

Continuing with the example with the view A and the view B, a typical case may be that the view A comprises pictures encoded with their respective temporal identities equal to zero, e.g. temporalid 0, and that the view B comprises pictures encoded with their respective temporal identities equal to one, e.g. temporalid 1. Action 202

The encoder 80 encodes all pictures belonging to the second view into one or more NAL units of the coded video sequence. Each NAL unit has a respective temporal identity greater than the first value. Thanks to that the second view is encoded into at least one other temporal layer which has a temporal identity greater than the first value, it will be possible to drop, or discard, the second view, when the coded video sequence is decoded in the decoder 50 as described in actions 207 and 208 below.

Alternatively, action 201 may equally well define the respective temporal identity to be strictly less than first value if action 202 defines the respective temporal identity to be greater than or equal to the first value. In this case, the first value may be equal to or greater than one, since the temporal identity may not be less than zero. Typically, the temporal identity is an integer number.

Action 201 and 202 implies that the coded video sequence may be comprised in a bitstream, e.g. generated by the encoder 80. The bitstream may comprise one or more NAL units. As an example, the bitstream may comprise a plurality of NAL units in order for the bitstream to include a plurality of pictures.

Action 203

The encoder 80 may set a threshold value for separating the multiple views of the coded video sequence. The threshold value may be set to the first value increased by one.

When action 203 is omitted, the threshold value may be predetermined. However, the encoder 80 may set the threshold value even when a predetermined threshold value exists. In these cases, the threshold value set by the encoder 80 may override the predetermined threshold value.

Action 204

Preferably, when action 203 has been performed, the encoder 80 may encode the threshold value into the bitstream. In this manner, the decoder 50 may decode the threshold value from the bitstream in action 205 below.

Now turning to the actions performed by the decoder 50, as indicated by the action holder shown in Figure 2, the actions below may follow after any one of actions 202, 203 and 204. Action 205

When the encoder 80 has performed action 204, the decoder 50 may decode the threshold value from the bitstream.

In other examples, the threshold value may be predetermined. Advantageously, the bitstream need not include the threshold value. In this way, a size of the bitstream may be reduced as compared to when the threshold value is included in the bitstream.

Action 206

Even if the threshold is not predefined or the encoder 80 did not perform action 204, the decoder 50 may select the threshold value by deducing the threshold value from the coded video sequence. This means that the decoder 50 may, as an example, detect by analysing temporal identities of decoded NAL units that e.g. only two temporal identities are used in the bitstream. The decoder 50 may then deduct that the threshold value should be selected to have a value between these two temporal identities.

Action 207

In order to be able to use an identity in e.g. action 208 and 209, the decoder 50 obtains the identity associated with the NAL unit of the coded video sequence. The identity may thus be obtained from bitstream including the coded video sequence. The identity, or temporal identity, relates to a temporal layer of the NAL unit. The coded video sequence may comprise the temporal layer and at least one further temporal layer.

As an example, the decoder 50 may decode at least a portion of the NAL unit of the coded video sequence to obtain the identity.

As another example, the decoder 50 may obtain the identity from a system layer information field in the bitstream. The identity may be copied from the NAL unit to a system layer information field, such as in a so called HEVC descriptor in a MPEG-2 transport system.

The identity is typically copied to the system layer information field directly after the NAL unit has been encoded, when the NAL unit is packetized into the system layer. Alternatively, if the content, i.e. the bitstream including video data, is off-line encoded and stored in a file the identity may be copied to the system layer once the encoded file is read and reformatted to be sent out, as e.g. a broadcasted bitstream.

Action 208 In case, e.g. a display device (not shown) at which the coded video sequence is to be displayed does not support multiple views, the decoder 50 discards the NAL unit when the identity exceeds, e.g. is above or equal to, the threshold value. Action 209

For the NAL unit that should be displayed at the display device, the decoder 50 may decode the NAL unit when the identity is below the threshold value.

Similarly to action 201 and 202, the decoder may, for action 208 and 209 equally well be defined to discard the NAL unit when the identity is greater than the threshold value and for those NAL units that should be displayed, the decoder 50 may decode the NAL when the identity is below or equal to the threshold value.

Action 210

In examples when the decoder 50 is comprised in the network node 110, the network node 110 may forward the NAL unit when the identity is below the threshold value. In this manner, the size of the bitstream to user equipment 120 may become smaller. As a result, bandwidth requirements for a connection thereto may be less demanding. In this example, the decoder 50 only interprets the NAL unit header in order to obtain the identity. Thus, the decoder 50 may be said to be only a partial implementation of a complete decoder that is able to decode pictures to be output, e.g. output to a display screen.

In some first embodiments, the frame packing arrangement is

frame_packing_arrangement_type 5. In some second embodiments, the frame packing arrangment may be any other frame packing arrangement but not

frame_packing_arrangement_type 5.

The embodiments, and in particular the first and second embodiments, herein apply to video sequences that comprises of two representations, hereafter called view A and view B. Although it is not necessary that they represent the two views of a stereo sequence, the second "view" can alternatively for example represent additional chroma information of the first view or depth information of the first view. The two representations are encoded into the same bitstream using frame packing arrangment.

The first embodiments relate to the case where temporal interleaving is used. In the first embodiment all pictures that belong to view A (a first view) have Temporalld lower than X and all pictures that belong to view B (a second view) have Temporalld higher than or equal to X, where X is a value between 1 and the max value for temporal layers (7 for HEVC), inclusive. An alternative of the first embodiment is that all pictures that belong to view A have Temporalld lower than or equal to X and all pictures that belong to view B have

Temporalld higher than X, where X is a value between 0 and the max value for temporal layers (7 for HEVC) minus 1.

In one version of the embodiment the value X is predetermined in the standard.

In one version of the embodiment the value X is signaled in the bitstream, for example in the VUI part of the SPS. Alternatively the value X is signaled in PPS, an SEI message or in some other data structure.

In one version of the embodiment the value X is provided by external means, not specified in the video coding standard.

In one version of the embodiment the value X is selected by the encoder. In one version of the embodiment the value X is not signaled in the bitstream. A decoder might use the fact that the two views are separated with Temporallds opportunistically, in a best effort fashion or by deducing it from other syntax elements in the bitstream such as the slice header and/or FPA SEI messages.

In one version of the embodiment all pictures that belong to view A have Temporalld lower than X and all pictures that belong to view B have Temporalld equal to X, where X is a value between 1 and the max value for temporal layers (7 for HEVC), inclusive.

A decoder can be configured to use the embodiment according to the following steps.

1. If the decoder is in a system that supports display of stereoscopic video, i.e. both views are to be rendered; the decoder is configured to decode all pictures and to deliver them as output.

2. Otherwise (the decoder is in a system that only supports display of single view (2D) video) the decoder is configured to discard all NAL units that belong to one of the views such as the second view, e.g. all NAL units that belong to temporal layers higher than or equal to X. Thus, the decoder comprises one or more units that is configured to determine which NAL units that belong to the respective views by checking the identity of the temporal layer of the NAL units. Accordingly, the decoder is configured to manage multiple views for frame packed pictures of a coded video sequence such that it is well defined and easy to extract the information needed to decode (and display) a single view or both views, depending on the capabilities of the decoding system.

Preferably the decoder is provided by external means with the information regarding if only a single view is to be output or if both views are to be output. Alternatively the decoder itself decides whether it is appropriate to output one view or both views. Alternatively the decoder is a part of a system that is configured to use the embodiment according to the following steps.

1. The decoder is configured to decode all NAL units and to output all pictures together with the information of the Temporalld of each picture.

2. A separate process applied to the pictures that have been output from the decoder discards the pictures belonging to the second view wherein the pictures are e.g. associated with Temporalld higher than or equal to X.

The decoder can be implemented in a rendering device.

In an example, an encoder can be configured to use the embodiment according to the following steps. 1. All pictures that belong to view A are encoded with Temporalld 0.

2. All pictures that belong to view B are encoded with Temporalld 1.

3. The value of X is set to, and signaled (in the case it is signaled) as 1.

Alternatively, an encoder comprises one or more units configured to perform the following steps. 1. All pictures that belong to one of the views e.g. view A are encoded with

Temporallds as selected by the encoder for example in order to support one or more levels of temporal scalability or in order to support trick-play. The highest value selected for Temporalld of any picture in view A is equal to Y where Y is less than the maximum value for Temporalld (7 in HEVC).

2. All pictures that belong to another view, e.g. view B are encoded with

Temporalld equal to Y+1 .

3. The value of X is set to, and signaled (in the case it is signaled) as Y+1.

Alternatively, an encoder comprises one or more units configured to perform the following steps.

1. All pictures that belong to view A are encoded with Temporallds as selected by the encoder for example in order to support one or more levels of temporal scalability or in order to support trick-play. The highest value selected for Temporalld of any picture in view A is equal to Y where Y is less than the maximum value for Temporalld (7 in HEVC).

2. All pictures that belong to view B are encoded with Temporalld as selected by the encoder for example in order to support one or more levels of temporal scalability or in order to support trick-play. The values selected for

Temporalld of pictures in view B are in the range from Y+1 to the maximum value for Temporalld (7 in HEVC), inclusive.

3. The value of X is set to, and signaled (in the case it is signaled) as Y+1.

The encoder can be implemented in a video camera or in a rendering device.

A unit that operates on a bitstream (such as a network node) can be configured to perform the following steps.

1. If the bitstream is sent (forwarded) to a decoder in a system that supports display of stereoscopic video, i.e. both views are to be rendered; the unit sends (forwards) all pictures (all NAL units).

2. Otherwise (the bitstream is sent [forwarded] to a decoder in a system that only supports display of single view [2D] video) the unit only sends (forwards) the NAL units that belongs to temporal layers lower than X. Additional versions of the embodiments include all combinations of the above described versions.

The first embodiments may be extended to handle three or more views. One view for each value of the Temporalld. Now proceeding with the second embodiments, there may be an indication in the bitstream (video layer) of the number of views that the sequence contains, e.g., whether the sequence contains two views or one view. Such an indication can be realized by a flag in the sequence parameter set, e.g. in the VUI section, with two modes; one that indicates that the coded video sequence consists of two views and one that indicates that the coded video sequence consists of one view.

This indication is preferably signaled in the SPS (e.g. in the VUI part of the SPS

(Sequence parameter set)) and is valid for an entire CVS. VUI is a number of optional syntax element in SPS sometimes referred to as VUI parameters. One syntax element i.a. a flag vui_parameters_present_flag indicative of the presence of VUI parameters. The VUI parameters are suitable for frame packing syntax since they are not mandatory for decoding but used for displaying the picture correctly.)

Having the frame packing information coupled with the activation of a parameter set is very useful in scenarios where stereo content and 2D content is mixed in the same stream as it is not reliable to depend on the presence of SEI messages to deduce whether the pictures represent stereo coding and 2D coding. Having this indication in the SPS means that there is no risk that the format is unknown when a new SPS is activated since the information about the format is included in the SPS.

Alternatively the indication is realized by a parameter with three different modes;

• one that indicates that the coded video sequence consists of two views, · one that indicates that the coded video sequence consists of one view

• one that indicates that it is unknown whether the video sequence consists of two views or one view or that the video sequence consists of mix of one view and two views. In one version of the embodiment there is an indication in the system layer, such as in the HEVC descriptor in the MPEG-2 transport system that indicates frame packing information according to any of the two schemes described above.

In one version of the embodiment the indication exists both in the bitstream (video layer) and in the system layer and the indication in the system layer is required to have the same value as the one in the bitstream layer.

In another version of the embodiment the indication exists both in the bitstream (video layer) and in the system layer and the indication in the system layer is defined such that

• it can only indicate the two view case if all CVSs in the video layer

indicates the two view case,

• it can only indicate the one view case if all CVSs in the video layer

indicates the one view case,

• if there is a mix of one view CVSs and two view CVSs in the video layer (or if the indications in the video layers are unknown) the system layer is required to signal the mix of one view and two view case (or unknown).

The second embodiment can be implemented by an encoder, wherein the encoder comprises one or more units configured to determine the number of views, to set the indicator according to the number of views and to send the indicator to the decoder.

Further, the second embodiment can be implemented by a decoder, wherein the decoder comprises one or more units configured to receive the indicator and to interpret the indicator to determine the number of views.

In a further version of the embodiment, more frame packing information may be signaled in the VUI, such as frame_packing_arrangement_type and content_interpretation_type as specified in the annex. Also, a syntax element temporal_constituent_mode may be signaled, preferably in (but herein not restricted to) the VUI. If temporal interleaving is used, this mode indicates whether odd POC values indicate that the corresponding picture is a constituent frame 0 or a constituent frame 1. The parity of the POC thus indicates to the decoder what view the corresponding picture belongs to. Alternatively, temporal_constituent_mode may directly indicate the view pictures with even and odd POC values respectively belong to. The display time of the outputted picture pair may be signaled to always use the time of view 0 or view 1 , by a display_time_mode syntax element. Alternatively, it may be defined to always use the time of view 0 or view 1 without any signaling. The presence of syntax elements frame_packing_arrangement_type, content_interpretation_type, temporal_constituent_mode and display_time_mode may be conditioned on the mode "that indicates that the coded video sequence consists of two views" as described above.

The decoder needs to know whether the received encoded representation is 2D or 3D in order to be able to display the picture correctly. The indication can be sent in VUI as described above, since VUI parameters are never discarded and if VUI belongs to the SPS, there is no risk that the decoder is unaware of the number of views when a new CVS arrives. As soon as a new SPS is activated the decoder obtains knowledge of the number of views.

In view of the second embodiments, the following problems may have been identified and solved. When a bitstream that consists of two frame packed views is sent to a decoder that is only capable of displaying one view video, it is not possible to extract, i.e remove, one view from the bistream and still know that the remaining bitstream is compliant with the HEVC standard. Illustratively, the decoder may be comprised in a system that is capable of displaying a two dimensional video, i.e. only one view of the two frame packed views sent to the decoder can be displayed.

Another problem is that when there is a format switch within a bitstream, at a Random Access Point, it is ambiguous to discover that there has been a format switch since there is no information contained in the RAP picture, the PPS or the SPS that indicates that a format switch has taken place. Instead, prior art solutions depend on the presence or absence of FPA SEI messages. This is not robust to packet losses since there is no guarantee that the SEI message is received when the new SPS is activated. that may be solved according to some examples herein are descrived in the following.

Therefore, the second embodiments also improves the signaling to be aligned with activation of SPS, i.e. the start of a new coded video sequence. To conclude, in the existing HEVC standard, there are no means to indicate that the second view can be dropped and the remaining bitstream is a compliant bitstream. Further, the HEVC standard does not provide a clear and robust way of signaling whether the coded video represents a single view or two views packed in the same stream using a frame packing arrangement. Thus, the embodiments relate to methods and arrangements for managing multiple views for frame packed pictures of a coded video sequence such that it is well defined and easy to extract the information needed to extract (and display) a single view or both views, depending on the capabilities of the decoding system.

Therefore, according to one embodiment, methods and arrangements for using

Temporalld to separate two views encoded in the same bitstream, using frame packing arrangement, are provided, such that one view can be discarded by a network node or a decoder that only delivers one view.

According to another embodiment, methods and arrangements are provided to indicate in the bitstream (video layer) the number of views that the coded video sequence contains, e.g., whether the coded video sequence contains two views or one view.

An aspect of the embodiments relates to a method of decoding an encoded

representation of picture. The method comprises managing multiple views for frame packed pictures of a coded video sequence such that it is well defined and easy to extract the information needed to extract (and display) a single view or both views, depending on the capabilities of the decoding system.

A related aspect of the embodiments defines a decoder for decoding an encoded representation of a picture. The decoder comprises one or more units configured to manage multiple views for frame packed pictures of a coded video sequence such that it is well defined and easy to extract the information needed to extract (and display) a single view or both views, depending on the capabilities of the decoding system.

A further aspect of the embodiment relates to a method of encoding a picture. The method comprises managing multiple views for frame packed pictures of a coded video sequence such that it is well defined and easy to extract the information needed to extract (and display) a single view or both views, depending on the capabilities of the decoding system.

A related further aspect of the embodiments defines an encoder for encoding a picture. The encoder comprises one or more units configured to manage multiple views for frame packed pictures of a coded video sequence such that it is well defined and easy to extract the information needed to extract (and display) a single view or both views, depending on the capabilities of the decoding system.

An additional aspect of the embodiments relates to a receiver comprising a decoder for decoding an encoded representation of a picture.

Yet another additional aspect of the embodiments relates to a transmitter comprising an encoder for encoding a picture.

The encoder and/or the decoder can be implemented in a device such as a video camera or a rendering device.

A further aspect of the embodiment relates to a computer program for decoding an encoded representation of a picture. The computer program comprises code means which when run by a processor causes the processor to manage multiple views for frame packed pictures of a coded video sequence such that it is well defined and easy to extract the information needed to extract (and display) a single view or both views, depending on the capabilities of the decoding system.

Yet a further related aspect of the embodiments defines a computer program for encoding a picture. The computer program comprises code means which when run by a processor causes the processor to manage multiple views for frame packed pictures of a coded video sequence such that it is well defined and easy to extract the information needed to extract (and display) a single view or both views, depending on the capabilities of the decoding system. Another additional aspect of the embodiments relates to a computer program product comprising computer readable medium and a computer program stored on the computer readable medium. In Figure 3, an exemplifying, schematic flowchart of the method in the decoder 50 is shown. As mentioned, the decoder 50 performs a method for managing multiple views of a coded video sequence. The multiple views are temporally interleaved with respect to output order of the coded video sequence.

As mentioned, the coded video sequence may be a HEVC compliant coded video sequence. The coded video sequence may contain two views.

In a start state 300, the decoder 50 may receive a bitstream from e.g. a memory or another device connected thereto, via a wired or wireless connection.

One or more of the following actions may be performed in any suitable order.

Action 301

The decoder 50 may decode the threshold value from a bitstream, comprising the coded video sequence. The bitstream may be configured according to frame packing arrangement type 5. The threshold value may be predetermined. This action is similar to action 205.

Action 302

The decoder 50 may select the threshold value by deducing the threshold value from the coded video sequence. This action is similar to action 206. Action 303

The decoder 50 obtains an identity associated with a NAL unit of the coded video sequence. The identity relates to a temporal layer of the NAL unit. The coded video sequence may comprise the temporal layer and at least one further temporal layer.

The coded video sequence may comprise at least said NAL unit, and wherein the coded video sequence may be comprised in the bitstream or wherein the coded video sequence may be comprised in a bitstream. This action is similar to action 207.

Action 304 The decoder 50 discards the NAL unit when the identity is above or equal to a threshold value for separating the multiple views of the coded video sequence. This action is similar to action 208. Action 305

The decoder 50 may decode the NAL unit when the identity is below the threshold value. This action is similar to action 209.

Action 306

The decoder 50 may forward the NAL unit when the identity is below the threshold value.

This action is similar to action 210.

In an end state 307, the decoder 50 may be ready to receive another coded video sequence.

Figure 4 illustrates a simplified and schematic example of a decoder 401 according to embodiments herein to get an overview before giving more detailed information with reference to Figures 5-7. In the block diagram, it is illustrated that the decoder may receive the bitstream. As an example, the decoder may comprise an in unit 402 and a decoding unit 403.

Now in more detail, with reference to Figure 5, there is illustrated a decoder 50 configured to manage multiple views of a coded video sequence. The multiple views are temporally interleaved with respect to output order of the coded video sequence. The coded video sequence may contain two views. The coded video sequence may be a HEVC compliant coded video sequence.

The decoder 50 comprises a processing circuit 52, i.e. "processor" in Figure 5, configured to obtain an identity associated with a NAL unit associated with the coded video sequence. The identity relates to a temporal layer of the NAL unit. The coded video sequence may comprise the temporal layer and at least one further temporal layer.

The processing circuit 52 is further configured to discard the NAL unit when the identity is above or equal to a threshold value for separating the multiple views of the coded video sequence. The threshold value may be predetermined. The processing circuit 52 may be configured to decode the NAL unit when the identity may be below the threshold value

The processing circuit 52 may be configured to decode the threshold value from a bitstream, comprising the coded video sequence. The bitstream may be configured according to frame packing arrangement type 5.

The coded video sequence may comprise at least said NAL unit, and wherein the coded video sequence may be comprised in the bitstream, or wherein the coded video sequence may be comprised in a bitstream.

The processing circuit 52 may be configured to select the threshold value by deducing the threshold value from the coded video sequence.

The decoder 50 described herein could be implemented e.g. by one or more of the processor circuit 52 and adequate software with suitable storage or memory 54 therefore, a programmable logic device (PLD) or other electronic component(s) as shown in Figure 5. In addition, the decoder 50 preferably comprises an input or input unit 51 configured to receive the encoded representations of pictures, such as in the form of NAL units. A corresponding output or output unit 53 is configured to output the decoded pictures.

Typically the reference picture buffer is an integrated part of the decoder 50. The memory 54 may contain the reference picture buffer plus other things needed for decoding.

Figure 5 also illustrates a computer program 55, comprising computer readable code units which when executed on the decoder 50 causes the decoder 50 to perform the methods described in conjunction with Figure 2 and/or 3.

Finally, Figure 5 shows a computer program product 56, comprising a computer readable medium 57 and a computer program 55 as described above stored on the computer readable medium 57.

The computer readable medium may be a memory, a Universal Serial Bus (USB) memory, a DVD-disc, a Blu-ray disc, a software module that is received as a stream of data, a Flash memory, a hard drive etc. Figure 6 is a schematic block diagram of a decoder 40 according to the embodiments.

The decoder 40 comprises a view determining unit 41 , also configured to determine which view a picture belongs to based on the temporal id according to the first embodiment. The decoder also comprises a discard unit 43 configured to discard pictures based on the determination of which view the picture belongs to and if the decoder has a limitation of the number of views that it can handle. The decoder 40 also comprises a processing unit 42, also denoted processor or processing means or module, configured to cause the decoder 40 to manage multiple views for frame packed pictures of a coded video sequence such that it is well defined and easy to extract the information needed to extract (and display) a single view or both views, depending on the capabilities of the decoding system.

Further the decoder 40 comprises an indicator interpretation unit 44 configured to interpret the indicator received according to the second embodiments. A decoding unit 46 of the decoder 40 is configured to decode a received encoded representation, e.g. a CVS comprised in a bitstream.

The decoder 40 of Figure 6 with its included units could be implemented in hardware. There are numerous variants of circuitry elements that can be used and combined to achieve the functions of the units of the decoder 40. Such variants are encompassed by the

embodiments. Particular examples of hardware implementation of the decoder 40 is implementation in digital signal processor (DSP) hardware and integrated circuit technology, including both general-purpose electronic circuitry and application-specific (ASIC) circuitry.

An exemplifying decoder 32 can, for example, be located in a receiver 30, such as in a video camera, set-top-box or a display, e.g. in a mobile device as shown in Figure 7. The receiver 30 then comprises an input or input unit 31 configured to receive a coded bitstream, such as data packets of NAL units. The encoded representations of the NAL units are decoded by the decoder 32 as disclosed herein. The decoder 32 preferably comprises or is connected to a reference picture buffer 34 that temporarily stores already decoded pictures that are to be used as reference pictures for other pictures in the video stream. Decoded pictures are output from the receiver 30, such as from the reference picture buffer 34, by means of an output or output unit 33. These output pictures are sent to be displayed to a user on a screen or display of or connected, including wirelessly connected, to the receiver 30. The output pictures may also be stored on disk or transcoded without display.

In some examples, the receiver 30 may be the network node 1 10 configured to manage multiple views of a coded video sequence. The multiple views are temporally interleaved with respect to output order of the coded video sequence. The network node 1 10 is configured to decode at least a portion of a NAL unit of the coded video sequence to obtain an identity of the NAL unit. The identity relates to a temporal layer of the NAL unit.

The network node 1 10 is further configured to discard the NAL unit when the identity is above or equal to a threshold value for separating the multiple views of the coded video sequence. Moreover, the network node 110 is configured to forward the NAL unit when the identity is below the threshold value.

In Figure 8, an exemplifying, schematic flowchart of the method in the encoder 80 is shown. As mentioned, the encoder 80 performs a method for encoding multiple views of a video sequence into a coded video sequence. The coded video sequence may be a HEVC compliant coded video sequence. The multiple views comprise a first view and a second view. The multiple views may contain the first and second views. The method may comprise encoding the coded video sequence according to frame packing arrangement type 5.

In a start state 800, the encoder may receive raw, i.e. not encoded, video date, e.g. from a memory or a video camera.

One or more of the following actions may be performed in any suitable order. Action 801

The encoder 80 encodes all pictures belonging to the first view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity equal to or less than a first value. This action is similar to action 201.

Action 802

The encoder 80 encodes all pictures belonging to the second view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity greater than the first value. This action is similar to action 202.

Action 803

The encoder 80 may set a threshold value for separating the multiple views to the first value increased by one. The threshold value may be predetermined. This action is similar to action 203.

Action 804 The encoder 80 may encode the threshold value into a bitstream comprising the coded video sequence.

The coded video sequence may comprise said one or more NAL units, and wherein the coded video sequence may be comprised in the bitstream, or wherein the coded video sequence may be comprised in a bitstream. This action is similar to action 204.

In an end state 805, the encoder may be ready to send the coded video sequence to another device, or to a memory for storage. With reference to Figure 9, there is illustrated an encoder 80 configured to encode multiple views of a video sequence into a coded video sequence. As mentioned, the multiple views comprise a first view and a second view. The multiple views may contain the first and second views. The coded video sequence may be a HEVC compliant coded video sequence. The encoder 80 comprises a processing circuit 82, denoted "processor" in Figure 9, configured to encode all pictures belonging to the first view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity equal to or less than a first value.

The processing circuit 82 is further configured to encode all pictures belonging to the second view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity greater than the first value.

The processing circuit 82 may be configured to set a threshold value for separating the multiple views to the first value increased by one. The threshold value may be predetermined.

The processing circuit 82 may be configured to encode the threshold value into a bitstream comprising the coded video sequence.

The coded video sequence may comprise said one or more NAL units, and wherein the coded video sequence may be comprised in the bitstream, or wherein the coded video sequence may be comprised in a bitstream.

The processing circuit 82 may be configured to encode the coded video sequence according to frame packing arrangement type 5.

Figure 9 also illustrates a computer program 85, comprising computer readable code units which when executed on the encoder 80 causes the encoder 80 to perform the methods described in conjunction with Figure 2 and/or 8. Finally, Figure 9 illustrates a computer program product 86, comprising a computer readable medium 87 and a computer program 85 as described above stored on the computer readable medium 87.

The computer readable medium may be a memory, a Universal Serial Bus (USB) memory, a DVD-disc, a Blu-ray disc, a software module that is received as a stream of data, a Flash memory, a hard drive etc.

The encoder 80 described herein could thus be implemented e.g. by one or more of a processor 82, or processing circuit, and adequate software with suitable storage or memory 84 therefore, a programmable logic device (PLD) or other electronic component(s) as shown in Figure 9 and/or 10. In addition, the encoder 80 preferably comprises an input or input unit 81 configured to receive the pictures of the video stream. A corresponding output or output unit 83 is configured to output the encoded representations of the slices, preferably in the form of NAL units.

Figure 10 is a schematic block diagram of an encoder 70, configured to encode a picture, according to an embodiment herein. The encoder 70 comprises a generating unit 71 , configured to generate an indicator indicative of the number of views according to the second embodiment. One indicator is generated per CVS. It also comprises a determining unit 74 configured to determine to which view a picture belongs to and a setting unit 75 configured set the temporal Id accordingly according to the first embodiment. An encoding unit 72, of the encoder 70 is configured to encode the picture into an encoded representation of the picture..

The encoder 70 of Figure 10 with its including units could be implemented in hardware. There are numerous variants of circuitry elements that can be used and combined to achieve the functions of the units of the encoder 70. Such variants are encompassed by the

embodiments. Particular examples of hardware implementation of the encoder 70 is

implementation in digital signal processor (DSP) hardware and integrated circuit technology, including both general-purpose electronic circuitry and application-specific circuitry.

Now refering to Figure 11 , an encoder 62 can, for example, be located in a transmitter 60 in a video camera e.g. in a mobile device. The transmitter 60 then comprises an input or input unit 61 configured to receive pictures of a video stream to be encoded. The pictures are encoded by the encoder 62 as disclosed herein. Encoded pictures are output from the transmitter 60 by an output or output unit 63 in the form of a coded bitstream, such as of NAL units or data packets carrying such NAL units.

Hence, the embodiments herein apply to a decoder, an encoder and an element that operates on a bitstream. The element may be a network-node or a Media Aware Network Element.

The embodiments are not limited to HEVC but may be applied to any extension of HEVC such as a scalable extension or multiview extension or to a different video codec. The embodiments are applicable to 2D and 3D video.

It is to be understood that the choice of interacting units or modules, as well as the naming of the units are only for exemplary purpose, and may be configured in a plurality of alternative ways in order to be able to execute the disclosed process actions.

It should also be noted that the units or modules described in this disclosure are to be regarded as logical entities and not with necessity as separate physical entities. It will be appreciated that the scope of the technology disclosed herein fully encompasses other embodiments which may become obvious to those skilled in the art, and that the scope of this disclosure is accordingly not to be limited.

Reference to an element in the singular is not intended to mean "one and only one" unless explicitly so stated, but rather "one or more." All structural and functional equivalents to the elements of the above-described embodiments that are known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed hereby. Moreover, it is not necessary for a device or method to address each and every problem sought to be solved by the technology disclosed herein, for it to be encompassed hereby.

In the preceding description, for purposes of explanation and not limitation, specific details are set forth such as particular architectures, interfaces, techniques, etc. in order to provide a thorough understanding of the disclosed technology. However, it will be apparent to those skilled in the art that the disclosed technology may be practiced in other embodiments and/or combinations of embodiments that depart from these specific details. That is, those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the disclosed technology. In some instances, detailed descriptions of well-known devices, circuits, and methods are omitted so as not to obscure the description of the disclosed technology with unnecessary detail. All statements herein reciting principles, aspects, and embodiments of the disclosed technology, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, e.g. any elements developed that perform the same function, regardless of structure.

Thus, for example, it will be appreciated by those skilled in the art that block diagrams herein can represent conceptual views of illustrative circuitry or other functional units embodying the principles of the technology. Similarly, it will be appreciated that any flow charts, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

The functions of the various elements including functional blocks may be provided through the use of hardware such as circuit hardware and/or hardware capable of executing software in the form of coded instructions stored on computer readable medium. Thus, such functions and illustrated functional blocks are to be understood as being either hardware- implemented and/or computer-implemented, and thus machine-implemented.

Thus, for example, it will be appreciated by those skilled in the art that block diagrams herein can represent conceptual views of illustrative circuitry or other functional units embodying the principles of the technology. Similarly, it will be appreciated that any flow charts, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown. The embodiments described above are to be understood as a few illustrative examples of the present invention. It will be understood by those skilled in the art that various

modifications, combinations and changes may be made to the embodiments without departing from the scope of the present invention. In particular, different part solutions in the different embodiments can be combined in other configurations, where technically possible. Annex: Frame packing arrangement SEI message semantics The SEI message informs the decoder that the output cropped decoded picture contains samples of multiple distinct spatially packed constituent frames that are packed into one frame using an indicated frame packing arrangement scheme. This information can be used by the decoder to appropriately rearrange the samples and process the samples of the constituent frames appropriately for display or other purposes (which are outside the scope of this

Specification).

When an SEI NAL unit that contains a frame packing arrangement SEI message and has nuh_reserved_zero_6bits equal to 0 is present, the SEI NAL unit shall precede, in decoding order, the first VCL NAL unit in the access unit. This SEI message may be associated with pictures that are either frames or fields. The frame packing arrangement of the samples is specified in terms of the sampling structure of a frame in order to define a frame packing arrangement structure that is invariant with respect to whether a picture is a single field of such a packed frame or is a complete packed frame. frame_packing_arrangement_id contains an identifying number that may be used to identify the usage of the frame packing arrangement SEI message. The value of

frame_packing_arrangement_id shall be in the range of 0 to 232 - 2, inclusive.

Values of frame_packing_arrangement_id from 0 to 255 and from 512 to 2³¹ - 1 may be used as determined by the application. Values of frame_packing_arrangement_id from 256 to 51 1 and from 2³¹ to 2³² - 2 are reserved for future use by ITU-T | ISO/IEC. Decoders shall ignore (remove from the bitstream and discard) all frame packing arrangement SEI messages containing a value of frame_packing_arrangement_id in the range of 256 to 51 1 or in the range of 2³¹ to 2³² - 2, and bitstreams shall not contain such values. frame_packing_arrangement_cancel_flag equal to 1 indicates that the frame packing arrangement SEI message cancels the persistence of any previous frame packing arrangement SEI message in output order. frame_packing_arrangement_cancel_flag equal to 0 indicates that frame packing arrangement information follows. frame_packing_arrangement_type indicates the type of packing arrangement of the frames as specified in Table D-8. Table D-8 - Definition of frame_packing_arrangement_type

Interpretation

alue

Each component plane of the decoded frames contains a "checkerboard" based interleaving of corresponding planes of two constituent frames as illustrated in Figure 12.

Each component plane of the decoded frames contains a column based interleaving of corresponding planes of two constituent frames as illustrated in Figure 13 and 14.

Each component plane of the decoded frames contains a row based interleaving of corresponding planes of two constituent frames as illustrated in Figure 15 and 16.

Each component plane of the decoded frames contains a side-by-side packing arrangement of corresponding planes of two constituent frames as illustrated in Figures 17, 18 and 21.

Each component plane of the decoded frames contains a top-bottom packing arrangement of corresponding planes of two constituent frames as illustrated in Figure 19 and 20.

The component planes of the decoded frames in output order form a temporal interleaving of alternating first and second constituent frames as illustrated in Figure 22.

Each decoded frame constitutes a single frame without the use of a frame packing of multiple consituent frames (see NOTE 5).

Each component plane of the decoded frames contains a rectangular region frame packing arrangement of corresponding planes of two constituent frames as illustrated in Figure D-12 (missing). NOTE 1 - Figure 12 to 22 provide typical examples of rearrangement and upconversion processing for various packing arrangement schemes. Actual characteristics of the constituent frames are signalled in detail by the subsequent syntax elements of the frame packing arrangement SEI message. In Figure 12-22, an upconversion processing is performed on each constituent frame to produce frames having the same resolution as that of the decoded frame. An example of the upsampling method to be applied to a quincunx sampled frame as shown in Figure 12 or Figure 21 is to fill in missing positions with an average of the available spatially neighbouring samples (the average of the values of the available samples above, below, to the left and to the right of each sample to be generated). The actual upconversion process to be performed, if any, is outside the scope of this Specification.

NOTE 2 - When the output time of the samples of constituent frame 0 differs from the output time of the samples of constituent frame 1 (i.e., when field_views_flag is equal to 1 or frame_packing_arrangement_type is equal to 5) and the display system in use presents two views simultaneously, the display time for constituent frame 0 should be delayed to coincide with the display time for constituent frame 1. (The display process is not specified in this Recommendation | International Standard.)

NOTE 3 - When field_views_flag is equal to 1 or frame_packing_arrangement_type is equal to 5, the value 0 for fixed_pic_rate_within_cvs_flag is not expected to be prevalent in industry use of this SEI message. NOTE 4 - frame_packing_arrangement_type equal to 5 describes a temporal interleaving process of different views.

NOTE 5 - The value of frame_packing_arrangement_type equal to 6 is used to signal presence of 2D content (that is not frame packed) in 3D services that use such a mix of contents. The frame_packing_arrangement_type value of 6 should only be used with frame pictures and with content_interpretation_type equal to 0.

NOTE 6 - Figure D-12 (missing) provides an illustration of the rearrangement process for the frame_packing_arrangement_type value of 7.

All other values of frame_packing_arrangement_type are reserved for future use by ITU- T I ISO/I EC. It is a requirement of bitstream conformance that the bitstreams shall not contain such other values of frame_packing_arrangement_type. quincunx_sampling_flag equal to 1 indicates that each colour component plane of each constituent frame is quincunx sampled as illustrated in Figure 12 or 21 , and

quincunx_sampling_flag equal to 0 indicates that the colour component planes of each constituent frame are not quincunx sampled.

When frame_packing_arrangement_type is equal to 0, it is a requirement of bitstream conformance that quincunx_sampling_flag shall be equal to 1. When

frame_packing_arrangement_type is equal to 5, 6, or 7, it is a requirement of bitstream conformance that quincunx_sampling_flag shall be equal to 0.

NOTE 7 - For any chroma format (4:2:0, 4:2:2, or 4:4:4), the luma plane and each chroma plane is quincunx sampled as illustrated in Figure 12 when quincunx_sampling_flag equal to 1.

Let croppedWidth and croppedHeight be the width and height, respectively, of the cropped frame output from the decoder in units of luma samples, derived as follows: croppedWidth = pic_width_in_luma_samples -

SubWidthC * ( conf_win_right_offset +

conf_win_left_offset ) (D-1 ) croppedHeight = pic_height_in_luma_samples -

SubHeightC * ( conf_win_bottom_offset +

conf_win_top_offset ) (D-2)

When frame_packing_arrangement_type is equal to 7, it is a requirement of bitstream conformance that both of the following conditions shall be true: croppedWidth is an integer multiple of 3 * SubWidthC croppedHeight is an integer multiple of 3 * SubHeightC

Let oneThirdWidth and oneThirdHeight be derived as follows: oneThirdWidth = croppedWidth / 3 (D-3) oneThirdHeight = croppedHeight / 3 (D-4) When frame_packing_arrangement_type is equal to 7, the frame packing arrangement is composed of four rectangular regions as illustrated in Figure D-12. The upper left region contains consituent frame 0 and has corner points (xA, yA) (at upper left), (xB - 1 , yB) (at upper right), (xC - 1 , yC - 1 ) (at lower right) and (xD, yD - 1 ) (at lower left). Constituent frame 1 is decomposed into three regions, denoted as R1 , R2 and R3. Region R1 has corner points (xB, yB) (at upper left), (xL - 1 , yl_) (at upper right), (xl - 1 , yl - 1 ) (at lower right) and (xC, yC - 1 ) (at lower left); region R2 has corner points (xD, yD) (at upper left), (xG - 1 , yG) (at upper right), (xE, yE - 1 ) (at lower left) and (xF - 1 , yF - 1 ) (at lower right); region R3 has corner points (xG, yG) (at upper left), (xC - 1 , yC) (at upper right), (xF, yF - 1 ) (at lower left) and (xH - 1 , yH - 1 ) (at lower right). The (x, y) locations of points (xA, yA) to (xL, yl_) are calculated as follows in units of luma sample coordinate positions:

(xA, yA) = ( SubWidthC * conf_win_left_offset,

SubHeightC * conf_win_top_offset ) (D-5)

(xB, yB) = ( SubWidthC * conf_win_left_offset + 2 * oneThirdWidth,

SubHeightC * conf_win_top_offset ) (D-6)

(xC, yC) = ( SubWidthC * conf_win_left_offset + 2 * oneThirdWidth,

SubHeightC * conf_win_top_offset + 2 * oneThirdHeight ) (D-7)

(xD, yD) = ( SubWidthC * conf_win_left_offset,

SubHeightC * conf_win_top_offset + 2 * oneThirdHeight ) (D-8) yE) = ( SubWidthC * conf_win_left_offset,

SubHeightC * conf_win_top_offset + croppedHeight ) (D-9)

(xF, yF) = ( SubWidthC * conf_win_left_offset + oneThirdWidth,

SubHeightC * conf_win_top_offset + croppedHeight ) (D-10)

(xG, yG) = ( SubWidthC * conf_win_left_offset + oneThirdWidth,

SubHeightC * conf_win_top_offset + 2 * oneThirdHeight ) (D-1 1 )

(xH, yH) = ( SubWidthC * conf_win_left_offset + 2 * oneThirdWidth,

SubHeightC * conf_win_top_offset + croppedHeight ) (D-12) (xl, yl) = ( SubWidthC * conf_win_left_offset + croppedWidth,

SubHeightC * conf_win_top_offset + 2 * oneThirdHeight ) (D-13)

(xL, yL) = ( SubWidthC * conf_win_left_offset + croppedWidth,

SubHeightC * conf_win_top_offset + croppedHeight ) (D-14)

When frame_packing_arrangement_type is equal to 7, constituent frame 0 is obtained by cropping from the decoded frames the region R0 enclosed by points A, B, C, and D, and constituent frame 1 is obtained by stacking vertically the regions R2 and R3, obtained by cropping the areas enclosed by points D, G, F, and E and G, C, F, and H, respectively, and then placing the resulting rectangle to the right of the region R1 , obtained by cropping the area enclosed by points B, L, I, and C, as illustrated in Figure D-12. content_interpretation_type indicates the intended interpretation of the constituent frames as specified in Table D-9. Values of content_interpretation_type that do not appear in Table D-9 are reserved for future specification by ITU-T | ISO/I EC.

When frame_packing_arrangement_type is not equal to 6, for each specified frame packing arrangement scheme, there are two constituent frames that are referred to as frame 0 and frame 1.

Table D-9 - Definition of content_interpretation_type

Value Interpretation

0 Unspecified relationship between the frame packed constituent frames

1 Indicates that the two constituent frames form the left and right views of a stereo view scene, with frame 0 being associated with the left view and frame 1 being associated with the right view

2 Indicates that the two constituent frames form the right and left views of a stereo view scene, with frame 0 being associated with the right view and frame 1 being associated with the left view NOTE 8 - The value 2 for content_interpretation_type is not expected to be prevalent in industry use of this SEI message. However, the value was specified herein for purposes of completeness. spatial_flipping_flag equal to 1 , when frame_packing_arrangement_type is equal to 3 or 4, indicates that one of the two constituent frames is spatially flipped relative to its intended orientation for display or other such purposes.

When frame_packing_arrangement_type is equal to 3 or 4 and spatial_flipping_flag is equal to 1 , the type of spatial flipping that is indicated is as follows:

- If frame_packing_arrangement_type is equal to 3, the indicated spatial flipping is horizontal flipping.

- Otherwise (frame_packing_arrangement_type is equal to 4), the indicated spatial flipping is vertical flipping.

When frame_packing_arrangement_type is not equal to 3 or 4, it is a requirement of bitstream conformance that spatial_flipping_flag shall be equal to 0. When

frame_packing_arrangement_type is not equal to 3 or 4, the value 1 for spatial_flipping_flag is reserved for future use by ITU-T | ISO/IEC. When frame_packing_arrangement_type is not equal to 3 or 4, decoders shall ignore the value 1 for spatial_flipping_flag. frameO_flipped_flag, when spatial_flipping_flag is equal to 1 , indicates which one of the two constituent frames is flipped. When spatial_flipping_flag is equal to 1 , frameO_flipped_flag equal to 0 indicates that frame 0 is not spatially flipped and frame 1 is spatially flipped, and frameO_flipped_flag equal to 1 indicates that frame 0 is spatially flipped and frame 1 is not spatially flipped.

When spatial_flipping_flag is equal to 0, it is a requirement of bitstream conformance that frameO_flipped_flag shall be equal to 0. When spatial_flipping_flag is equal to 0, the value 1 for spatial_flipping_flag is reserved for future use by ITU-T | ISO/IEC. When spatial_flipping_flag is equal to 0, decoders shall ignore the value of frameO_flipped_flag. field_views_flag equal to 1 indicates that all pictures in the current coded video sequence are coded as complementary field pairs. All fields of a particular parity are considered a first constituent frame and all fields of the opposite parity are considered a second constituent frame. When frame_packing_arrangement_type is not equal to 2, it is a requirement of bitstream conformance that the field_views_flag shall be equal to 0. When

frame_packing_arrangement_type is not equal to 2, the value 1 for field_views_flag is reserved for future use by ITU-T | ISO/IEC. When frame_packing_arrangement_type is not equal to 2, decoders shall ignore the value of field_views_flag. current_frame_is_frameO_flag equal to 1 , when frame_packing_arrangement is equal to 5, indicates that the current decoded frame is constituent frame 0 and the next decoded frame in output order is constituent frame 1 , and the display time of the constituent frame 0 should be delayed to coincide with the display time of constituent frame 1. current_frame_is_frameO_flag equal to 0, when frame_packing_arrangement is equal to 5, indicates that the current decoded frame is constituent frame 1 and the previous decoded frame in output order is constituent frame 0, and the display time of the constituent frame 1 should not be delayed for purposes of stereo-view pairing. When frame_packing_arrangement_type is not equal to 5, the constituent frame associated with the upper-left sample of the decoded frame is considered to be consitutuent frame 0 and the other constituent frame is considered to be constituent frame 1. When frame_packing_arrangement_type is not equal to 5, it is a requirement of bitstream

conformance that current_frame_is_frameO_flag shall be equal to 0. When

frame_packing_arrangement_type is not equal to 5, the value 1 for

current_frame_is_frameO_flag is reserved for future use by ITU-T | ISO/IEC. When

frame_packing_arrangement_type is not equal to 5, decoders shall ignore the value of current_frame_is_frameO_flag. frameO_self_contained_flag equal to 1 indicates that no inter prediction operations within the decoding process for the samples of constituent frame 0 of the coded video sequence refer to samples of any constituent frame 1. frameO_self_contained_flag equal to 0 indicates that some inter prediction operations within the decoding process for the samples of constituent frame 0 of the coded video sequence may or may not refer to samples of some constituent frame 1. When frame_packing_arrangement_type is equal to 0 or 1 , it is a requirement of bitstream conformance that frameO_self_contained_flag shall be equal to 0. When

frame_packing_arrangement_type is equal to 0 or 1 , the value 1 for frameO_self_contained_flag is reserved for future use by ITU-T | ISO/IEC. When frame_packing_arrangement_type is equal to 0 or 1 , decoders shall ignore the value of frameO_self_contained_flag. Within a coded video sequence, the value of frameO_self_contained_flag in all frame packing arrangement SEI messages shall be the same. frame1_self_contained_flag equal to 1 indicates that no inter prediction operations within the decoding process for the samples of constituent frame 1 of the coded video sequence refer to samples of any constituent frame 0. frame 1_self_contained_flag equal to 0 indicates that some inter prediction operations within the decoding process for the samples of constituent frame 1 of the coded video sequence may or may not refer to samples of some constituent frame 0. When frame_packing_arrangement_type is equal to 0 or 1 , it is a requirement of bitstream conformance that frame1_self_contained_flag shall be equal to 0. When

frame_packing_arrangement_type is equal to 0 or 1 , the value 1 for frame1_self_contained_flag is reserved for future use by ITU-T | ISO/IEC. When frame_packing_arrangement_type is equal to 0 or 1 , decoders shall ignore the value of frame1_self_contained_flag. Within a coded video sequence, the value of frame1_self_contained_flag in all frame packing arrangement SEI messages shall be the same.

NOTE 9 - When frameO_self_contained_flag is equal to 1 or frame1_self_contained_flag is equal to 1 , and frame_packing_arrangement_type is equal to 2, it is expected that the decoded frame should not be an MBAFF frame.

When quincunx_sampling_flag is equal to 0 and frame_packing_arrangement_type is not equal to 5, two (x, y) coordinate pairs are specified to determine the indicated luma sampling grid alignment for constituent frame 0 and constituent frame 1 , relative to the upper left corner of the rectangular area represented by the samples of the corresponding constituent frame.

NOTE 10 - The location of chroma samples relative to luma samples can be indicated by the chroma_sample_loc_type_top_field and chroma_sample_loc_type_bottom_field syntax elements in the VUI parameters. frameO_grid_position_x (when present) specifies the x component of the (x, y) coordinate pair for constituent frame 0. frameO_grid_position_y (when present) specifies the y component of the (x, y) coordinate pair for constituent frame 0. frame 1_grid_position_x (when present) specifies the x component of the (x, y) coordinate pair for constituent frame 1. frame1_grid_position_y (when present) specifies the y component of the (x, y) coordinate pair for constituent frame 1. When quincunx_sampling_flag is equal to 0 and frame_packing_arrangement_type is not equal to 5 the (x, y) coordinate pair for each constituent frame is interpreted as follows:

- If the (x, y) coordinate pair for a constituent frame is equal to (0, 0), this indicates a default sampling grid alignment specified as follows:

- If frame_packing_arrangement_type is equal to 1 or 3, the indicated position is the same as for the (x, y) coordinate pair value (4, 8), as illustrated in Figure 13 and Figure 17.

- Otherwise (frame_packing_arrangement_type is equal to 2 or 4), the indicated position is the same as for the (x, y) coordinate pair value (8, 4), as illustrated in Figure 15 and Figure 19.

- Otherwise, if the (x, y) coordinate pair for a constituent frame is equal to (15, 15), this indicates that the sampling grid alignment is unknown or unspecified or specified by other means not specified in this Recommendation | International Standard.

- Otherwise, the x and y elements of the (x, y) coordinate pair specify the indicated horizontal and vertical sampling grid alignment positioning to the right of and below the upper left corner of the rectangular area represented by the corresponding constituent frame, respectively, in units of one sixteenth of the luma sample grid spacing between the samples of the columns and rows of the constituent frame that are present in the decoded frame (prior to any upsampling for display or other purposes).

NOTE 1 1 - The spatial location reference information frameO_grid_position_x, frameO_grid_position_y, frame1_grid_position_x, and frame 1_grid_position_y is not provided when quincunx_sampling_flag is equal to 1 because the spatial alignment in this case is assumed to be such that constituent frame 0 and constituent frame 1 cover corresponding spatial areas with interleaved quincunx sampling patterns as illustrated in Figure 12 and Figure 21 . NOTE 12 - When frame_packing_arrangement_type is equal to 2 and field_views_flag is equal to 1 , it is suggested that frameO_grid_position_y should be equal to

frame1_grid_position_y. frame_packing_arrangement_reserved_byte is reserved for future use by ITU-T | ISO/I EC. It is a requirement of bitstream conformance that the value of

frame_packing_arrangement_reserved_byte shall be equal to 0. All other values of

frame_packing_arrangement_reserved_byte are reserved for future use by ITU-T | ISO/IEC. Decoders shall ignore (remove from the bitstream and discard) the value of

frame_packing_arrangement_reserved_byte. frame_packing_arrangement_repetition_period specifies the persistence of the frame packing arrangement SEI message and may specify a frame order count interval within which another frame packing arrangement SEI message with the same value of

frame_packing_arrangement_id or the end of the coded video sequence shall be present in the bitstream. The value of frame_packing_arrangement_repetition_period shall be in the range of 0 to 16 384, inclusive. frame_packing_arrangement_repetition_period equal to 0 specifies that the frame packing arrangement SEI message applies to the current decoded frame only. frame_packing_arrangement_repetition_period equal to 1 specifies that the frame packing arrangement SEI message persists in output order until any of the following conditions are true:

- A new coded video sequence begins.

- A frame in an access unit containing a frame packing arrangement SEI message with the same value of frame_packing_arrangement_id is output having PicOrderCnt( ) greater than PicOrderCnt( CurrPic ). frame_packing_arrangement_repetition_period equal to 0 or equal to 1 indicates that another frame packing arrangement SEI message with the same value of

frame_packing_arrangement_id may or may not be present. frame_packing_arrangement_repetition_period greater than 1 specifies that the frame packing arrangement SEI message persists until any of the following conditions are true: A new coded video sequence begins.

- A frame in an access unit containing a frame packing arrangement SEI message with the same value of frame_packing_arrangement_id is output having PicOrderCnt( ) greater than PicOrderCnt( CurrPic ) and less than or equal to PicOrderCnt( CurrPic ) +

frame_packing_arrangement_repetition_period. frame_packing_arrangement_repetition_period greater than 1 indicates that another frame packing arrangement SEI message with the same value of

frame_packing_arrangement_frames_id shall be present for a frame in an access unit that is output having PicOrderCnt( ) greater than PicOrderCnt( CurrPic ) and less than or equal to PicOrderCnt( CurrPic ) + frame_packing_arrangement_repetition_period; unless the bitstream ends or a new coded video sequence begins without output of such a frame. upsampled_aspect_ratio_flag equal to 1 indicates that the sample aspect ratio (SAR) indicated by the VUl parameters of the sequence parameter set identifies the SAR of the samples after the application of an upconversion process to produce a higher resolution frame from each constituent frame as illustrated in Figure 12 to Figure 21.

upsampled_aspect_ratio_flag equal to 0 indicates that the SAR indicated by the VUl parameters of the sequence parameter set identifies the SAR of the samples before the application of any such upconversion process.

NOTE 13 - The default display window parameters in the VUl parameters of the sequence parameter set can be used by an encoder to indicate to a decoder that does not interpret the frame packing arrangement SEI message that the default display window is an area within only one of the two constituent frames.

NOTE 14 - The SAR indicated in the VUl parameters should indicate the preferred display picture shape for the packed decoded frame output by a decoder that does not interpret the frame packing arrangement SEI message. When upsampled_aspect_ratio_flag is equal to 1 , the SAR produced in each upconverted colour plane is indicated to be the same as the SAR indicated in the VUl parameters in the examples shown in Figure D-1 to Figure D-10. When upsampled_aspect_ratio_flag is equal to 0, the SAR produced in each colour plane prior to upconversion is indicated to be the same as the SAR indicated in the VUl parameters in the examples shown in Figures 12-21 . Figure 12 illustrates frame packing arrangement with rearrangement and upconversion of checkerboard interleaving

(frame_packing_arrangement_type equal to 0)

Figure 13 illustrates frame packing arrangement with rearrangement and upconversion of column interleaving with frame_packing_arrangement_type equal to 1 ,

quincunx_sampling_flag equal to 0, and (x, y) equal to (0, 0) or (4, 8) for both constituent frames

Figure 14 illustrates frame packing arrangement with rearrangement and upconversion of column interleaving with frame_packing_arrangement_type equal to 1 ,

quincunx_sampling_flag equal to 0, (x, y) equal to (0, 0) or (4, 8) for constituent frame 0 and (x, y) equal to (12, 8) for constituent frame 1

Figure 15 illustrates frame packing arrangement with rearrangement and upconversion of row interleaving with frame_packing_arrangement_type equal to 2, quincunx_sampling_flag equal to 0, and (x, y) equal to (0, 0) or (8, 4) for both constituent frames

Figure 16 illustrates frame packing arrangement with rearrangement and upconversion of row interleaving with frame_packing_arrangement_type equal to 2, quincunx_sampling_flag equal to 0, (x, y) equal to (0, 0) or (8, 4) for constituent frame 0, and (x, y) equal to (8, 12) for constituent frame 1

Figure 17 illustrates frame packing arrangement with rearrangement and upconversion of side-by-side packing arrangement with frame_packing_arrangement_type equal to 3, quincunx_sampling_flag equal to 0, and (x, y) equal to (0, 0) or (4, 8) for both constituent frames

Figure 18 illustrates frame packing arrangement with rearrangement and upconversion of side-by-side packing arrangement with frame_packing_arrangement_type equal to 3, quincunx_sampling_flag equal to 0, (x, y) equal to (12, 8) for constituent frame 0, and (x, y) equal to (0, 0) or (4, 8) for constituent frame 1 Figure 19 illustrates frame packing arrangement with rearrangement and upconversion of top-bottom packing arrangement with frame_packing_arrangement_type equal to 4, quincunx_sampling_flag equal to 0, and (x, y) equal to (0, 0) or (8, 4) for both constituent frames

Figure 20 illustrates frame packing arrangement with rearrangement and upconversion of top-bottom packing arrangement with frame_packing_arrangement_type equal to 4, quincunx_sampling_flag equal to 0, (x, y) equal to (8, 12) for constituent frame 0, and (x, y) equal to (0, 0) or (8, 4) for constituent frame 1

Figure 21 illustrates frame packing arrangement with rearrangement and upconversion of side-by-side packing arrangement with quincunx sampling

(frame_packing_arrangement_type equal to 3 with quincunx_sampling_flag equal to 1 )

Figure 22 illustrates frame packing arrangement with rearrangement of a temporal interleaving frame arrangement (frame_packing_arrangement_type equal to 5)

Figure D-12 (missing) illustrates frame packing arrangement with rearrangement and upconversion of rectangular region frame packing arrangement

(frame_packing_arrangement_type equal to 7).

Claims

1. A method, performed by a decoder (50), for managing multiple views of a coded video

sequence, wherein the multiple views are temporally interleaved with respect to output order of the coded video sequence, wherein the method comprises:

obtaining (207) an identity associated with a Network Abstraction Layer "NAL" unit of the coded video sequence, wherein the identity relates to a temporal layer of the NAL unit; and

discarding (208) the NAL unit when the identity is above or equal to a threshold value for separating the multiple views of the coded video sequence.

2. The method according to claim 1 , wherein the method further comprises:

decoding (209) the NAL unit when the identity is below the threshold value

The method according to claim 1 or 2, wherein the method further comprises:

decoding (205) the threshold value from a bitstream, comprising the coded video sequence.

4. The method according to claim 1 or 2, wherein the threshold value is predetermined.

5. The method according to claim 1 or 2, wherein the method further comprises:

selecting (206) the threshold value by deducing the threshold value from the coded video sequence.

6. The method according to any one of the preceding claims, wherein the bitstream is

configured according to frame packing arrangement type 5.

7. The method according to any one of the preceding claims, wherein the coded video

sequence comprises the temporal layer and at least one further temporal layer.

8. The method according to any one of the preceding claims, wherein the coded video

sequence contains two views.

9. The method according to any one of the preceding claims, wherein the coded video

sequence comprises at least said NAL unit, and wherein the coded video sequence is comprised in the bitstream when claim 9 depends on at least claim 3, or wherein the coded video sequence is comprised in a bitstream when claim 9 does not depend on claim 3.

10. The method according to any one of the preceding claims, wherein the coded video

sequence is a High Efficiency Video Coding "HEVC" compliant coded video sequence.

1 1 . A method, performed by an encoder (80), for encoding multiple views of a video sequence into a coded video sequence, wherein the multiple views comprise a first view and a second view, wherein the method comprises:

encoding (201) all pictures belonging to the first view into one or more Network

Abstraction Layer "NAL" units of the coded video sequence, wherein each NAL unit has a respective temporal identity equal to or less than a first value;

encoding (202) all pictures belonging to the second view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity greater than the first value.

12. The method according to claim 1 1 , wherein the method further comprises:

setting (203) a threshold value for separating the multiple views to the first value increased by one.

13. The method according to the preceding claim, wherein the method further comprises:

encoding (204) the threshold value into a bitstream comprising the coded video sequence.

14. The method according to claim 12, wherein the threshold value is predetermined.

15. The method according to any one of claims 11 -14, wherein the method comprises encoding the coded video sequence according to frame packing arrangement type 5.

16. The method according to any one of claims 11 -15, wherein the multiple views contains the first and second views.

17. The method according to any one of the claims 11-16, wherein the coded video sequence comprises said one or more NAL units, and wherein the coded video sequence is comprised in the bitstream when claim 17 depends on at least claim 13, or wherein the coded video sequence is comprised in a bitstream when claim 17 does not depend on claim 13.

18. The method according to any one of the claims 1 1-17, wherein the coded video sequence is a High Efficiency Video Coding "HEVC" compliant coded video sequence.

19. A decoder (50) configured to manage multiple views of a coded video sequence, wherein the multiple views are temporally interleaved with respect to output order of the coded video sequence, wherein the decoder (50) comprises a processing circuit (52) configured to: obtain an identity associated with a Network Abstraction Layer "NAL" unit of the coded video sequence, wherein the identity relates to a temporal layer of the NAL unit; and discard the NAL unit when the identity is above or equal to a threshold value for separating the multiple views of the coded video sequence.

20. The decoder (50) according to claim 19, wherein the processing circuit (52) further is

configured to decode the NAL unit when the identity is below the threshold value

21 . The decoder (50) according to claim 19 or 20, wherein the processing circuit (52) further is configured to decode the threshold value from a bitstream, comprising the coded video sequence.

22. The decoder (50) according to claim 19 or 20, wherein the threshold value is predetermined.

23. The decoder (50) according to claim 19 or 20, wherein the processing circuit (52) further is configured to select the threshold value by deducing the threshold value from the coded video sequence.

24. The decoder (50) according to any one of claims 19-23, wherein the bitstream is configured according to frame packing arrangement type 5.

25. The decoder (50) according to any one of claims 19-24, wherein the coded video sequence comprises the temporal layer and at least one further temporal layer.

26. The decoder (50) according to any one of claims 19-25, wherein the coded video sequence contains two views.

27. The decoder (50) according to any one of claims 19-26, wherein the coded video sequence comprises at least said NAL unit, and wherein the coded video sequence is comprised in the bitstream when claim 27 depends on at least claim 21 , or wherein the coded video sequence is comprised in a bitstream when claim 27 does not depend on claim 21.

28. The decoder (50) according to any one of claims 19-27, wherein the coded video sequence is a High Efficiency Video Coding "HEVC" compliant coded video sequence.

29. An encoder (80) configured to encode multiple views of a video sequence into a coded video sequence, wherein the multiple views comprise a first view and a second view, wherein the encoder (80) comprises a processing circuit (82) configured to:

encode all pictures belonging to the first view into one or more Network Abstraction

Layer "NAL" units of the coded video sequence, wherein each NAL unit has a respective temporal identity equal to or less than a first value;

encode all pictures belonging to the second view into one or more NAL units of the coded video sequence, wherein each NAL unit has a respective temporal identity greater than the first value.

30. The encoder (80) according to claim 29, wherein the processing circuit (82) further is

configured to set a threshold value for separating the multiple views to the first value increased by one.

31 . The encoder (80) according to the preceding claim, wherein the processing circuit (82) further is configured to encode the threshold value into a bitstream comprising the coded video sequence.

32. The encoder (80) according to claim 30, wherein the threshold value is predetermined.

33. The encoder (80) according to any one of claims 29-32, wherein the processing circuit (82) further is configured to encode the coded video sequence according to frame packing arrangement type 5.

34. The encoder (80) according to any one of claims 29-33, wherein the multiple views contains the first and second views.

35. The encoder (80) according to any one of the claims 29-34, wherein the coded video

sequence comprises said one or more NAL units, and wherein the coded video sequence is comprised in the bitstream when claim 35 depends on at least claim 31 , or wherein the coded video sequence is comprised in a bitstream when claim 35 does not depend on claim 31 .

36. The encoder (80) according to any one of the claims 29-35, wherein the coded video

37. A computer program (55), comprising computer readable code units which when executed on a decoder (50) causes the decoder (50) to perform the method according to any one of claims 1 -10.

38. A computer program product (56), comprising a computer readable medium (57) and a

computer program (55) according to claim 37 stored on the computer readable medium (57).

39. A computer program (85), comprising computer readable code units which when executed on an encoder (70, 80) causes the encoder (70, 80) to perform the method according to any one of claims 1 1-18.

40. A computer program product (86), comprising a computer readable medium (87) and a

computer program (85) according to claim 39 stored on the computer readable medium (87).

41 . A network node (1 10) configured to manage multiple views of a coded video sequence, wherein the multiple views are temporally interleaved with respect to output order of the coded video sequence, wherein the network node (1 10) is configured to:

obtain an identity associated with a Network Abstraction Layer "NAL" unit of the coded video sequence, wherein the identity relates to a temporal layer of the NAL unit; discard the NAL unit when the identity is above or equal to a threshold value for separating the multiple views of the coded video sequence; and

forward the NAL unit when the identity is below the threshold value.