WO2022069790A1

WO2022069790A1 - A method, an apparatus and a computer program product for video encoding/decoding

Info

Publication number: WO2022069790A1
Application number: PCT/FI2021/050541
Authority: WO
Inventors: Miska Matias Hannuksela
Original assignee: Nokia Technologies Oy
Priority date: 2020-09-30
Filing date: 2021-07-28
Publication date: 2022-04-07
Also published as: EP4222977A1

Abstract

The embodiments relate to a method and technical equipment, wherein the method comprises creating a control signal for controlling encoding of a bitstream (710); transmitting the control signal to more than one encoders (720); receiving a plurality of encoded media streams, which have been encoded according to the transmitted control signal (730); merging encoded input bitstreams into one output bitstream (740); and providing the output bitstream comprising the merged encoded input bitstreams for decoding (750).

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING/DECODING

Technical Field

The present solution generally relates to video encoding and video decoding.

Background

This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.

A video coding system may comprise an encoder that transforms an input video into a compressed representation suited for storage/transmission, and a decoder that can uncompress the compressed video representation back into a viewable form. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form, for example, to enable the storage/transmission of the video information at a lower bitrate than otherwise might be needed.

Summary

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

Now there has been invented an improved method and technical equipment implementing the method. Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising creating a control signal for controlling encoding of a bitstream; transmitting the control signal to more than one encoders; receiving a plurality of encoded media streams, which have been encoded according to the transmitted control signal; merging encoded input bitstreams into one output bitstream; and providing the output bitstream comprising the merged encoded input bitstreams for decoding.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: to create a control signal for controlling encoding of a bitstream; to transmit the control signal to more than one encoders; to receive a plurality of encoded media streams, which have been encoded according to the transmitted control signal; to merge encoded input bitstreams into one output bitstream; and to provide the output bitstream comprising the merged encoded input bitstreams for decoding.

According to a third aspect, there is provided an apparatus comprising at least means for creating a control signal for controlling encoding of a bitstream; means for transmitting the control signal to more than one encoders; means for receiving a plurality of encoded media streams, which have been encoded according to the transmitted control signal; means for merging encoded input bitstreams into one output bitstream; and means for providing the output bitstream comprising the merged encoded input bitstreams for decoding.

According to a fourth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to create a control signal for controlling encoding of a bitstream; to transmit the control signal to more than one encoders; to receive a plurality of encoded media streams, which have been encoded according to the transmitted control signal; to merge encoded input bitstreams into one output bitstream; and to provide the output bitstream comprising the merged encoded input bitstreams for decoding. According to an embodiment, an input bitstream corresponds to a subpicture in the output bitstream.

According to an embodiment, the control signal comprises one or more of the following:

- reference picture set for all pictures;

- signal to use temporally forward prediction only;

- signal the picture prediction hierarchy;

- relation of picture order differences to picture time interval;

- allowed picture types or VCL NAL unit types;

- picture width and height;

- PPS identifier value referenced by coded video data;

- subpicture identifier value to be included in slice headers; and

- constraints which coding tools are disabled.

According to an embodiment, the apparatus is caused to assign picture order count values into a output bitstream by overwriting the picture order count values of the input bitstream.

According to an embodiment, the apparatus is caused to include a not-coded subpicture into the output bitstream, where there is no corresponding picture in the corresponding input bitstream.

According to an embodiment, the not-coded subpicture may be one of the following: an IDR subpicture or a trailing subpicture being encoded by reconstructing it by copying the collocated subpicture from the previous picture.

According to an embodiment, the apparatus is further caused to signal a picture prediction hierarchy and a sublayer hierarchy to be used as a part of the control signal.

According to an embodiment, the apparatus is further caused to determine that an encoded media streams comprises multiple temporal sublayers, whereupon the apparatus is caused to omit a picture in the higher temporal sublayer and not to include such in the output bitstream and to request a refresh of the temporal sublayer from which the picture was omitted.

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an NBMP architecture for use in accordance with example embodiments of the invention

Fig. 2 shows an example of multipoint video conferencing;

Fig. 3 shows an example of video decoding Engine and interfaces;

Fig. 4 shows examples of encoding structures;

Fig. 5 shows an example of incoming bitstreams;

Fig. 6 shows an example of a destination bitstream;

Fig. 7 shows an example of multipoint conferencing without an MCU;

Fig. 8 is a flowchart illustrating a method according to an embodiment;

Fig. 9 shows an apparatus according to an embodiment;

Fig. 10 shows an encoding process according to an embodiment; and

Fig. 11 shows a decoding process according to an embodiment.

Embodiments

In the following, several embodiments will be described in the context of one video coding arrangement. The video coding arrangement can be part of a system enabling streamable and participatable events. An example of such a system is a video conferencing system, but it is to be noted, however, that the present embodiments are not necessarily limited to such systems, but can be applied in other systems where content from multiple video sources are to be transmitted to end users. In the present embodiments a sender is a video source, such as a camera, or a system being operatively connected to a video source or comprising means to record video. In addition to video, the sender is configured to record other content as well, such as audio. The purpose of the sender is to stream video being recorded to a receiver(s). The sender may also comprise means for receiving at least audio data from receiver(s) and output the received audio data. The system discussed in the present description may comprise multiple senders and multiple receivers. A device participating to the system may act as a sender or a receiver or as both.

The content being streamed or transmitted is delivered between devices through transmission channel or a communication channel. Such a channel may be a physical transmission medium, such as a wire, or a logical connection over a multiplexed medium.

Real-time Transport Protocol (RTP) is widely used for real-time transport of timed media such as audio and video. RTP may operate on top of the User Datagram Protocol (UDP), which in turn may operate on top of the Internet Protocol (IP). RTP is specified in Internet Engineering Task Force (IETF) Request for Comments (RFC) 3550, available from www.ietf.org/rfc/rfc3550.txt. In RTP transport, media data is encapsulated into RTP packets. Typically, each media type or media coding format has a dedicated RTP payload format.

An RTP session is an association among a group of participants communicating with RTP. It is a group communications channel which can potentially carry a number of RTP streams. An RTP stream is a stream of RTP packets comprising media data. An RTP stream is identified by an SSRC belonging to a particular RTP session. SSRC refers to either a synchronization source or a synchronization source identifier that is the 32-bit SSRC field in the RTP packet header. A synchronization source is characterized in that all packets from the synchronization source form part of the same timing and sequence number space, so a receiver device may group packets by synchronization source for playback. Examples of synchronization sources include the sender of a stream of packets derived from a signal source such as a microphone or a camera, or an RTP mixer. Each RTP stream is identified by a SSRC that is unique within the RTP session. Real-time control protocol (RTCP) enables monitoring of the data delivery in a manner scalable to large multicast networks and provides minimal control and identification functionality. An RTCP stream accompanies an RTP stream. RTCP sender report (SR) packets are sent from the sender to the receiver (i.e., in the same direction as the media in the respective RTP stream). RTCP receiver report (RR) packets are sent from the receiver to the sender.

A point-to-point RTP session is consists of two endpoints, communicating using unicast. Both RTP and RTCP traffic are conveyed endpoint to endpoint.

Many multipoint audio-visual conferences operate utilizing a centralized unit called Multipoint Control Unit (MCU). An MCU may implement the functionality of an RTP translator or an RTP mixer. An RTP translator may be a media translator that may modify the media inside the RTP stream. A media translator may for example decode and re-encode the media content (i.e. transcode the media content). An RTP mixer is a middlebox that aggregates multiple RTP streams that are part of a session by generating one or more new RTP streams. An RTP mixer may manipulate the media data. One common application for a mixer is to allow a participant to receive a session with a reduced amount of resources compared to receiving individual RTP streams from all endpoints. A mixer can be viewed as a device terminating the RTP streams received from other endpoints in the same RTP session. Using the media data carried in the received RTP streams, a mixer generates derived RTP streams that are sent to the receiving endpoints.

The Session Description Protocol (SDP) may be used to convey media details, transport addresses, and other session description metadata, when initiating multimedia teleconferences, voice-over-IP calls, or other multimedia delivery sessions.

The offer/answer model specifies a mechanism in which endpoints use achieve a common operating point of media details and other session description metadata when initiating the multimedia delivery session. One endpoint, the offerer, sends a session description (the offer) to the other endpoint, the answerer. The offer contains all the media parameters needed to exchange media with the offerer, including codecs, transport addresses, and protocols to transfer media. When the answerer receives an offer, it elaborates an answer and sends it back to the offerer. The answer contains the media parameters that the answerer is willing to use for that particular session. SDP may be used as the format for the offer and the answer.

An initial SDP offer includes zero or more media streams, wherein each media stream is described by an "m=" line and its associated attributes. Zero media streams implies that the offerer wishes to communicate, but that the streams for the session will be added at a later time through a modified offer.

A direction attribute may be used in the SDP offer/answer model as follows. If the offerer wishes to only send media on a stream to its peer, it marks the stream as sendonly with the "a=sendonly" attribute. If the offerer wishes to only receive media from its peer, it marks the stream as recvonly. If the offerer wishes to both send and receive media with its peer, it may include an "a=sendrecv" attribute in the offer, or it may omit it, since sendrecv is the default.

In the SDP offer/answer model, the list of media formats for each media stream comprises the set of formats (codecs and any parameters associated with the codec, in the case of RTP) that the offerer is capable of sending and/or receiving (depending on the direction attributes). If multiple formats are listed, it means that the offerer is capable of making use of any of those formats during the session and thus the answerer may change formats in the middle of the session, making use of any of the formats listed, without sending a new offer. For a sendonly stream, the offer indicates those formats the offerer is willing to send for this stream. For a recvonly stream, the offer indicates those formats the offerer is willing to receive for this stream. For a sendrecv stream, the offer indicates those codecs or formats that the offerer is willing to send and receive with. The list of media formats in the "m=" line is listed in the order preference, the first entry in the list being the most preferred.

SDP may be used for declarative purposes, e.g. for describing a stream available to be received over a streaming session. For example, SDP may be included in Real Time Streaming Protocol (RTSP).

A Multipurpose Internet Mail Extension (MIME) is an extension to an email protocol which makes it possible to transmit and receive different kinds of data files on the Internet, for example video, audio, images, and software. An internet media type is an identifier used on the Internet to indicate the type of data that a file contains. Such internet media types may also be called as content types. Several MIME type/subtype combinations exist that can contain different media formats. Content type information may be included by a transmitting entity in a MIME header at the beginning of a media transmission. A receiving entity thus may need to examine the details of such media content to determine if the specific elements can be rendered given an available set of codecs. Especially when the end system has limited resources, or the connection to the end system has limited bandwidth, it may be helpful to know from the content type alone if the content can be rendered.

One of the original motivations for MIME is the ability to identify the specific media type of a message part. However, due to various factors, it is not always possible from looking at the MIME type and subtype to know which specific media formats are contained in the body part or which codecs are indicated in order to render the content. Optional media parameters may be provided in addition to the MIME type and subtype to provide further details of the media content.

Optional media parameters may be conveyed in SDP, e.g. using the "a=fmtp" line of SDP. Optional media parameters may be specified to apply for certain direction attribute(s) with an SDP offer/answer and/or for declarative purposes. Optional media parameters may be specified not to apply for certain direction attribute(s) with an SDP offer/answer and/or for declarative purposes. Semantics of optional media parameters may depend on and may differ based on which direction attribute(s) of an SDP offer/answer they are used with and/or whether they are used for declarative purposes.

The Network Based Media Processing (NBMP), ISO/IEC 23090-8, is a standard, which was developed by the Moving Picture Experts Group (MPEG). NBMP defines a framework that enables initializing and controlling media processing in the network. The NBMP framework defines the interfaces including both data formats and application progressing interfaces (APIs) among the entities connected through the digital networks for media processing. Users can access and configure their operations remotely for efficient processing. The framework describes and manages workflows to be applied to the media data. This process includes uploading of media data to the network, instantiation of the media processing tasks, and configuration of the tasks. The framework enables dynamic creation of media processing pipelines, access of processed media data and metadata in real-time or in a deferred way. Figure 1 shows an NBMP architecture for use in accordance with example embodiments of the invention. As shown in Figure 1 there is an NBMP source 101 providing an NBMP workflow API with a workflow description 102 to an NBMP workflow manager 105. As shown in Figure 1 the NBMP workflow manager 105 is processing the NBMP workflow API with a function repository 110, and the NBMP source 101 is also exchanging a function discovery API and function description 104 with the Function repository 110. Then as shown in Figure 1 the NBMP workflow manager 105 provides to the media processing entity (MPE) 112 the NBMP task API 108 including task configuration and reporting the current task status. As shown in Figure 1 the media processing entity (MPE) 112 processes the media flow 117 from the media source 115 using a task 1 and task 2 configuration 111. Then as shown in Figure 1 a media flow 119 is output towards the media sink 120. As shown in Figure 1 the operations at 102, 104, 106, and 108 include control flow operations, and the operations 117, 111 , and 119 include data flow operations.

Network-Based Media Processing (NBMP) enables offloading media processing tasks to the network-based environment like the cloud computing environments. NBMP processing relies on a Workflow Manager, that can be virtualized, to start and control media processing. The Workflow Manager receives a Workflow Description from the NBMP Source, which instructs the Workflow Manager about the desired processing and the input and output formats to be taken and produced, respectively.

The workflow manager (the Manager) creates a workflow based on the workflow description document (WDD) that it receives from the NBMP Source. The workflow manager selects and deploys the NBMP Functions into selected Media Processing Entities and then performs the configuration of the tasks. The WDD can include a number of logic descriptors

The NBMP can define APIs and formats such as Function templates and Workflow Description Document (WDD) consisting of a number of logic descriptors. NBMP uses the so-called descriptors as the basic elements for its all resource documents such as the workflow documents, task documents, and function documents. Descriptors are a group of NBMP parameters which describe a set of related characteristics of Workflow, Function or Task. Some key descriptors are General, Input, Output, Processing, Requirements, Configuration etc. In order to hide workflow internal details from the NBMP Source, all updates to the workflow are performed through Workflow Manager. The manager is the single point of access for the creation or change of any workflows. Workflows represent the processing flows defined in WDD provided by NBMP Source (aka. the client). A workflow can be defined as a chain of tasks, specified by the “connection-map” Object in the Processing Descriptor of the WDD.

The Workflow Manager may use pre-determined implementations of media processing functions and use them together to create the media processing workflow. NBMP defines a Function Discovery API that it uses with a Function Repository to discover and load the desired Functions.

A Function, once loaded, becomes a Task, which is then configured by the Workflow Manager through the Task API and can start processing incoming media. It is noted that a cloud and/or network service providers can define their own APIs to assign computing resources to their customers.

The Advanced Video Coding standard (which may be abbreviated AVC or H.264/AVC) was developed by the Joint Video Team (JVT) of the Video Coding Experts Group (VCEG) of the Telecommunications Standardization Sector of International Telecommunication Union (ITU-T) and the Moving Picture Experts Group (MPEG) of International Organization for Standardization (ISO) / International Electrotechnical Commission (IEC). The H.264/AVC standard is published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.264 and ISO/IEC International Standard 14496-10 also known as MPEG-4 Part 10 Advanced Video Coding (AVC). There have been multiple versions of the H.264/AVC standard, each integrating new extensions or features to the specification. These extensions include Scalable Video Coding (SVC) and Multiview Video Coding (MVC).

The High Efficiency Video Coding standard (which may be abbreviated HEVC or H.265/HEVC) was developed by the Joint Collaborative Team - Video Coding (JCT- VC) of VCEG and MPEG. The standard is published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.265 and ISO/IEC International Standard 23008-2, also known as MPEG-H Part 2 High Efficiency Video Coding (HEVC). Extensions to H.265/HEVC include scalable, multiview, three-dimensional, and fidelity range extensions, which may be referred to as SHVC, MV-HEVC, 3D-HEVC, and REXT respectively. The references in this description to H.265/HEVC, SHVC, MV-HEVC, 3D-HEVC and REXT that have been made for the purpose of understanding definitions, structures or concepts of these standard specifications are to be understood to be references to the latest versions of these standards that were available before the date of this application, unless otherwise indicated.

The Versatile Video Coding standard (VVC, H.266 or H.266/VVC) is developed by the Joint Video Experts Team (JVET), which is a collaboration between the ISO/IEC MPEG and ITU-T VCEG.

Some key definitions, bitstream and coding structures, and concepts of H.264/AVC, HEVC, and VVC, and some of their extensions are described in this section as an example of a video encoder, decoder, encoding method, decoding method, and a bitstream structure, wherein the embodiments may be implemented. Some of the key definitions, bitstream and coding structures and concepts may be common among some video coding standards - hence, they may be described below jointly. The aspects of various embodiments are not limited to H.264/AVC, HEVC, WC, or their extensions, but rather the description is given for one possible basis on top of which the present embodiments may be partly or fully realized.

Video codec may comprise an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. The compressed representation may be referred to as a bitstream or a video bitstream. A video encoder and/or a video decoder may also be separate from each other, i.e. need not form a codec. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).

A bitstream may be defined as a sequence of bits, which may in some coding formats or standards be in the form of a Network Abstraction Layer (NAL) unit stream or a byte stream, that forms the representation of coded pictures and associated data forming one or more coded video sequences.

A video decoding process may be specified for a bitstream, conforming to certain bitstream syntax and constraints, which may be specified e.g. in a coding standard. Consequently, the decoding of two bitstreams may be completely independent of each other.

A first bitstream may be followed by a second bitstream in the same logical channel, such as in the same file or in the same connection of a communication protocol. An elementary stream (in the context of video coding) may be defined as a sequence of one or more bitstreams. In some coding formats or standards, the end of the first bitstream may be indicated by a specific NAL unit, which may be referred to as the end of bitstream (EOB) NAL unit and which is the last NAL unit of the bitstream.

A coded video sequence (CVS) may be defined as such a sequence of coded pictures in decoding order that is independently decodable and is followed by another coded video sequence or the end of the bitstream.

A coded picture is a coded representation of a picture.

Hybrid video codecs, for example ITU-T H.264, may encode the video information in two phases. At first, pixel values in a certain picture area (or “block”) are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Then, the prediction error, i.e. the difference between the predicted block of pixels and the original block of pixels, is coded. This may be done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate).

In temporal prediction, the sources of prediction are previously decoded pictures (a.k.a. reference pictures). In intra block copy (IBC; a.k.a. intra-block copy prediction or current picture referencing), prediction is applied similarly to temporal prediction, but the reference picture is the current picture and only previously decoded samples can be referred in the prediction process. Inter-layer or inter-view prediction may be applied similarly to temporal prediction, but the reference picture is a decoded picture from another scalable layer or from another view, respectively. In some cases inter prediction may refer to temporal prediction only, while in other cases inter prediction may refer collectively to temporal prediction and any of intra block copy, inter-layer prediction, and inter-view prediction provided that they are performed with the same or similar process than temporal prediction. Inter prediction or temporal prediction may sometimes be referred to as motion compensation or motion-compensated prediction.

Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, i.e., either sample values or transform coefficients can be predicted. Intra prediction may be exploited in intra coding, where no inter prediction is applied.

One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy- coded more efficiently if they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be collectively referred to as in-picture prediction.

Entropy coding/decoding may be performed in many ways. For example, contextbased coding/decoding may be applied, where in both the encoder and the decoder modify the context state of a coding parameter based on previously coded/decoded coding parameters. Context-based coding may for example by context adaptive binary arithmetic coding (CABAC) or context-based variable length coding (CAVLC) or any similar entropy coding. Entropy coding/decoding may alternatively or additionally be performed using a variable length coding scheme, such as Huffman coding/decoding or Exp-Golomb coding/decoding. Decoding of coding parameters from an entropy-coded bitstream or codewords may be referred to as parsing.

Video coding standards may specify the bitstream syntax and semantics as well as the decoding process for error-free bitstreams, whereas the encoding process might not be specified, but encoders may just be required to generate conforming bitstreams. Bitstream and decoder conformance can be verified with the Hypothetical Reference Decoder (HRD). The standards may contain coding tools that help in coping with transmission errors and losses, but the use of the tools in encoding may be optional and decoding process for erroneous bitstreams might not have been specified. A syntax element may be defined as an element of data represented in the bitstream. A syntax structure may be defined as zero or more syntax elements present together in the bitstream in a specified order.

An elementary unit for the input to an encoder and the output of a decoder, respectively, in most cases is a picture. A picture given as an input to an encoder may also be referred to as a source picture, and a picture decoded by a decoder may be referred to as a decoded picture or a reconstructed picture.

The source and decoded pictures are each comprises of one or more sample arrays, such as one of the following sets of sample arrays:

- Luma (Y) only (monochrome),

- Luma and two chroma (YCbCr or YCgCo),

- Green, Blue and Red (GBR, also known as RGB),

- Arrays representing other unspecified monochrome or tri-stimulus color samplings (for example, YZX, also known as XYZ).

In the following, these arrays may be referred to as luma (or L or Y) and chroma, where the two chroma arrays may be referred to as Cb and Cr; regardless of the actual color representation method in use. The actual color representation method in use can be indicated e.g. in a coded bitstream e.g. using the Video Usability Information (VUI) syntax of HEVC or alike. A component may be defined as an array or single sample from one of the three sample arrays (luma and two chroma) or the array or a single sample of the array that compose a picture in monochrome format.

A picture may be defined to be either frame or a field. A frame comprises a matrix of luma samples and possibly the corresponding chroma samples. A field is a set of alternate sample rows of a frame and may be used as encoder input, when the source signal is interlaced. Chroma sample arrays may be absent (and hence monochrome sampling may be in use) or chroma sample arrays may be subsampled when compared to luma sample arrays.

Some chroma formats may be summarized as follows:

- In monochrome sampling there is only one sample array, which may be nominally considered the luma array. - In 4:2:0 sampling, each of the two chroma arrays has half the height and half the width of the luma array.

- In 4:2:2 sampling, each of the two chroma arrays has the same height and half the width of the luma array.

- In 4:4:4 sampling when no separate color planes are in use, each of the two chroma arrays has the same height and width as the luma array.

Coding formats or standards may allow to code sample arrays as separate color planes into the bitstream and respectively decode separately coded color planes from the bitstream. When separate color planes are in use, each one of them is separately processed (by the encoder and/or the decoder) as a picture with monochrome sampling.

When chroma subsampling is in use (e.g. 4:2:0 or 4:2:2 chroma sampling), the location of chroma samples with respect to luma samples may be determined in the encoder side (e.g. as pre-processing step or as part of encoding). The chroma sample positions with respect to luma sample positions may be pre-defined for example in a coding standard, such as H.264/AVC or HEVC, or may be indicated in the bitstream for example as part of VIII of H.264/AVC or HEVC.

Generally, the source video sequence(s) provided as input for encoding may either represent interlaced source content or progressive source content. Fields of opposite parity have been captured at different times for interlaced source content. Progressive source content contains captured frame. An encoder may encode fields of interlaced source content in two ways: a pair of interlaced fields may be coded into a coded frame or a field may be coded as a coded field. Likewise, an encoder may encode frames of progressive source content in two ways: a frame of progressive source content may be coded into a coded frame or a pair of coded fields. A field pair or a complementary field pair may be defined as two fields next to each other in decoding and/or output order, having opposite parity (i.e. one being a top field and another being a bottom field) and neither belonging to any other complementary field pair. Some video coding standards or schemes allow mixing of coded frames and coded fields in the same coded video sequence. Moreover, predicting a coded field from a field in a coded frame and/or predicting a coded frame for a complementary field pair (coded as fields) may be enabled in encoding and/or decoding. Partitioning may be defined as a division of a set into subsets such that each element of the set is in exactly one of the subsets.

In a draft version of H.266/WC, the following partitioning applies. It is noted that what is described here might still evolve in later draft versions of H.266/WC util the standard is finalized. Pictures are partitioned into coding tree units (CTUs) with a maximum size of 128 x 128, although encoders may choose to use a smaller size, such as 64 x 64. A coding tree unit (CTU) is first partitioned by a quaternary tree (a.k.a. quadtree) structure. Then the quaternary tree leaf nodes can be further partitioned by a multi-type tree structure. There are four splitting types in multi-type tree structure, vertical binary splitting, horizontal binary splitting, vertical ternary splitting, and horizontal ternary splitting. The multi-type tree leaf nodes are called coding units (CUs). CU, Pll and Til have the same block size, unless the CU is too large for the maximum transform length. A segmentation structure for a CTU is a quadtree with nested multi-type tree using binary and ternary splits, i.e. no separate CU, PU and TU concepts are in use, except when needed for CUs that have a size too large for the maximum transform length. A CU can have either a square or rectangular shape.

The decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial domain). After applying prediction and prediction error decoding means, the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.

The filtering may for example include one or more of the following: deblocking, sample adaptive offset (SAO), adaptive loop filtering (ALF), and/or luma mapping with chroma scaling (LMCS).

A deblocking filter (DBF) may be configured to reduce blocking artefacts due to block-based coding. DBF may be applied (e.g., applied only) to samples located at prediction unit and/or transform block boundaries, and may be constrained not to apply at the picture boundaries or when disabled at image segment boundaries, such as slice, tile, or subpicture boundaries. Horizontal filtering may be applied (e.g., applied first) for vertical boundaries, and vertical filtering may be applied for horizontal boundaries.

The deblocking loop filter may include multiple filtering modes or strengths, which may be adaptively selected based on the features of the blocks adjacent to the boundary, such as the quantization parameter value, and/or signaling included by the encoder in the bitstream. For example, the deblocking loop filter may comprise a normal filtering mode and a strong filtering mode, which may differ in terms of the number of filter taps (i.e. number of samples being filtered on both sides of the boundary) and/or the filter tap values. For example, filtering of two samples along both sides of the boundary may be performed with a filter having the impulse response of (3 7 9 -3)/16, when omitting the potential impact of a clipping operation.

Sample adaptive offset (SAO) may be another in-loop filtering process that modifies decoded samples by conditionally adding an offset value to a sample (e.g., each sample), based on values in look-up tables transmitted by the encoder. SAO may have one or more (e.g., two) operation modes; such as band offset and edge offset modes. In the band offset mode, an offset may be added to the sample value depending on the sample amplitude. The full sample amplitude range may be divided into a fixed number of bands, such as 32 bands, and sample values belonging to four of these bands may be modified by adding a positive or negative offset, which may be signalled for each coding tree unit (CTU). In the edge offset mode, the horizontal, vertical, and two diagonal gradients may be used for classification.

In WC, an Adaptive Loop Filter (ALF) with block-based filter adaption may be applied. In WC, ALF may be applied for luma, chroma, and/or cross component (CC ALF or CC-ALF). When luma ALF is enabled, one among 25 filters is selected for each 4x4 block for the luma component, based on the direction and activity of local gradients. CC-ALF uses luma sample values to refine each chroma component by applying an adaptive, linear filter to the luma channel and then using the output of this filtering operation for chroma refinement.

In WC, there can be up to 8 ALF APSs. It is required that the content of ALF APSs must remain unchanged for an entire picture, i.e. ALF APSs cannot be updated in the middle of a picture. Each ALF APS can have up to 25 luma filters, up to 8 chroma filters, up to 4 CC-ALF filters for the Cb component, and up to 4 CC-ALF filters for the Cr component. ALF filter parameters may be signalled in Adaptation Parameter Set (APS). In one ALF APS, up to 25 (i.e. NumAlfFilters) sets of luma filter coefficients and clipping value indexes, and up to eight sets of chroma filter coefficients and clipping value indexes could be signalled.

In WC slice header, the number and identifiers of the ALF APSs used for the current slice are signaled. In a slice header, up to 7 ALF APS identifiers can be signaled to specify the luma filter sets that are used for the current slice. The filtering process can further be controlled at CTU level. A flag is signalled to indicate whether ALF is applied to luma blocks in a CTU. For the luma blocks of each CTU, an encoder may choose and indicate in a bitstream and a decoder may decode from a bitstream, on CTU basis, a filter set among 16 fixed filter sets (pre-defined in the WC standard) and the filter sets from ALF APSs. A filter set index is signaled for a CTU to indicate which luma filter set is applied. The 16 fixed filter sets are pre-defined and hard- coded in both the encoder and the decoder.

The following applies for chroma ALF in WC: For the chroma components, an APS index is signaled in slice header to indicate the chroma filter sets being used for the current slice. At CTU level, a filter index is signaled for each CTU if there is more than one chroma filter set in the ALF APS.

WC encoders may generate the content of ALF APSs adaptively based on the content and may select the applied ALF APSs in a spatially adaptive manner both in slice basis and in CTU basis.

In WC, LMCS is a process that is applied as part of the decoding process that maps luma samples to particular values and may apply a scaling operation to the values of chroma samples. An encoder may choose the LMCS parameters adaptively on picture basis based on the content of the picture(s).

The motion information may be indicated with motion vectors associated with each motion compensated image block in video codecs. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures. In order to represent motion vectors efficiently, those may be coded differentially with respect to block specific predicted motion vectors. The predicted motion vectors may be created in a predefined way, for example calculating the media of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictors is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index may be predicted from adjacent blocks and/or co-located blocks in temporal reference picture. Moreover, high efficiency video codecs may employ an additional motion information coding/decoding mechanism often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction. Similarly, predicting the motion field information is carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.

Video codecs may support motion compensated prediction from one source image (uni-prediction) and two sources (bi-prediction). In the case of uni-prediction, a single motion vector is applied whereas in the case of bi-prediction, two motion vectors are signaled and the motion compensated predictions from two sources are averaged to create the final sample prediction. In the case of weighted prediction, the relative weights of the two predictions can be adjusted, or a signaled offset can be added to the prediction signal.

In general, the (de)coding order of pictures need not be the same as the output order of pictures. Moreover, any previous pictures in (de)coding order may generally be used as reference pictures for motion compensation. Thus, generally, the reference pictures for bi-prediction may not be required to comprise both subsequent picture(s) and previous picture(s) in output order. The term temporally forward prediction may refer to uni-prediction or bi-prediction where all the reference pictures are preceding the current picture in output order. The term temporally backward prediction may refer to uni-prediction or bi-prediction where all the reference pictures are succeeding the current picture in output order. The term temporally bi-directional prediction may refer to uni-prediction or bi-prediction where reference pictures are or may be both preceding and succeeding the current picture in output order.

In addition to applying motion compensation for inter picture prediction, similar approach can be applied to intra picture prediction. In this case the displacement vector indicates where from the same picture, a block of samples can be copied to form a prediction of the block to be coded or decoded. This kind of intra block copying methods can improve the coding efficiency substantially in presence of repeating structure within the frame - such as text or other graphics.

The prediction residual after motion compensation or intra prediction may be first transformed with a transform kernel (like DCT) and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.

In WC, joint coding of chroma residuals (JCCR) may improve coding efficiency when there is correlation between Cb and Cr residuals. In JCCR mode the residual samples for both Cb and Cr are coded as a single transform block. When the mode is active, one of the chroma channels is assigned with the decoded transform block while the other channel is assigned either negative or positive version of the block; or a version with those with sample values halved. Signalling on transform block level is used to indicate if halved or full sample values are used, while there is a picture header flag (the ph_joint_cbcr_sign_flag syntax element) determining the sign for the other residual.

Video encoders may utilize Lagrangian cost functions to find optimal coding modes, e.g. the desired Macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor A to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel value in an image area:

C = D + AR (Eq. 1 ) where C is the Lagrangian cost to be minimized, D is the image distortion (e.g Mean Squared Error) with the mode and motion vectors considered, and R is the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).

Some codecs use a concept of picture order count (POC). A value of POC is derived for each picture and is non-decreasing with increasing picture position in output order. POC therefore indicates the output order of pictures. A POC value of a picture may be relative to POC values of other pictures in the same coded video sequence, whereas POC values of pictures in different coded video sequences might not be indicative of their respective output order. POC may be used in the decoding process for example for implicit scaling of motion vectors and for reference picture list initialization. Furthermore, POC may be used in the verification of output order conformance.

In video coding standards, a compliant bit stream must be able to be decoded by a hypothetical reference decoder that may be conceptually connected to the output of an encoder and consists of at least a pre-decoder buffer, a decoder and an output/display unit. This virtual decoder may be known as the hypothetical reference decoder (HRD) or the video buffering verifier (VBV). A stream is compliant if it can be decoded by the HRD without buffer overflow or, in some cases, underflow. Buffer overflow happens if more bits are to be placed into the buffer when it is full. Buffer underflow happens if some bits are not in the buffer when said bits are to be fetched from the buffer for decod ing/playback. One of the motivations for the HRD is to avoid so-called evil bitstreams, which would consume such a large quantity of resources that practical decoder implementations would not be able to handle.

HRD models may include instantaneous decoding, while the input bitrate to the coded picture buffer (CPB) of HRD may be regarded as a constraint for the encoder and the bitstream on decoding rate of coded data and a requirement for decoders for the processing rate An encoder may include a CPB as specified in the HRD for verifying and controlling that buffering constraints are obeyed in the encoding. A decoder implementation may also have a CPB that may, but does not necessarily, operate similarly or identically to the CPB specified for HRD.

A Decoded Picture Buffer (DPB) may be used in the encoder and/or in the decoder. There may be two reasons to buffer decoded pictures, for reference in inter prediction and for reordering decoded pictures into output order. Some coding formats, such as HEVC, provide a great deal of flexibility for both reference picture marking and output reordering, separate buffers for reference picture buffering and output picture buffering may waste memory resources. Hence, the DPB may include a unified decoded picture buffering process for reference pictures and output reordering. A decoded picture may be removed from the DPB when it is no longer used as a reference and is not needed for output. An HRD may also include a DPB. DPBs of an HRD and a decoder implementation may but do not need to operate identically.

Output order may be defined as the order in which the decoded pictures are output from the decoded picture buffer (for the decoded pictures that are to be output from the decoded picture buffer).

A decoder and/or an HRD may comprise a picture output process. The output process may be considered to be a process in which the decoder provides decoded and cropped picture as the output of the decoding process. The output process may be a part of video coding standards, e.g. as a part of the hypothetical reference decoder specification. In output cropping, lines and/or columns of samples may be removed from decoded pictures according to a cropping rectangle to form output pictures. A cropped decoded picture may be defined as the result of cropping a decoded picture based on the conformance cropping window specified e.g. in the sequence parameter set that is referred to by the corresponding coded picture.

One or more syntax structures for (decoded) reference picture marking may exist in a video coding system. An encoder generates an instance of a syntax structure e.g. in each coded picture, and a decoder decodes an instance of the syntax structure e.g. from each coded picture. For example, the decoding of the syntax structure may cause pictures to be adaptively marked as "used for reference" or "unused for reference".

A reference picture set (RPS) syntax structure of HEVC is an example of a syntax structure for reference picture marking. A reference picture set valid or active for a picture includes all the reference pictures that may be used as reference for the picture and all the reference pictures that are kept marked as "used for reference" for any subsequent pictures in decoding order. The reference pictures that are kept marked as "used for reference" for any subsequent pictures in decoding order but that are not used as reference picture for the current picture or image segment may be considered inactive. For example, they might not be included in the initial reference picture list(s).

In some coding formats and codecs, a distinction is made between so-called shortterm and long-term reference pictures. This distinction may affect some decoding processes such as motion vector scaling. Syntax structure(s) for marking reference pictures may be indicative of marking a picture as "used for long-term reference" or "used for short-term reference".

In some coding formats, reference picture for inter prediction may be indicated with an index to a reference picture list. In some codecs, two reference picture lists (reference picture list 0 and reference picture list 1 ) are generated for each bi- predictive (B) slice, and one reference picture list (reference picture list 0) is formed for each inter-coded (P) slice.

In WC, reference picture lists are indicated directly in a reference picture list syntax structure. When a picture is present in any reference picture list of the current picture (within active or inactive entries of any reference picture list), it marked as "used for long-term reference" or "used for short-term reference". When a picture is present in no reference picture list of the current picture, it is marked as "unused for reference". The abbreviation RPL may be used to refer to the reference picture list syntax structure and/or to one or more reference picture lists. The number of active entries in a reference picture list may be indicated by an encoder and/or decoded by a decoder, and may indicate the number of pictures starting from the first list entry that may be used as reference for prediction of the current picture. The entries in a reference picture list that are no among the active entries may be defined to be inactive entries, are not used as reference for prediction of the current picture, and may be used as reference for prediction of subsequent pictures in decoding order.

A Decoded Picture Buffer (DPB) may be used in the encoder and/or in the decoder. There may be two reasons to buffer decoded pictures, for references in inter prediction and for reordering decoded pictures into output order. Since some codecs, such as WC, provide flexibility for both reference picture marking and output reordering, separate buffers for reference picture buffering and output picture buffering may waste memory resources. Hence, the DPB may include a unified decoded picture buffering process for reference pictures and output reordering. A decoded picture may be removed from the DPB when it is no longer used as a reference and is not needed for output.

Scalable video coding refers to coding structure where one bitstream can contain multiple representations of the content at different bitrates, resolutions or frame rates. In these cases, the receiver can extract the desired representation depending on its characteristics (e.g. resolution that matches best the display device). Alternatively, a server or a network element can extract the portions of the bitstream to be transmitted to the receiver depending on e.g. the network characteristics or processing capabilities of the receiver. A scalable bitstream may include a "base layer" providing the lowest quality video available and one or more enhancement layers that enhance the video quality when received and decoded together with the lower layers. In order to improve coding efficiency for the enhancement layers, the coded representation of that layer may depend on the lower layers. E.g. the motion and mode information of the enhancement layer can be predicted from lower layers. Similarly, the pixel data of the lower layers can be used to create prediction for the enhancement layer.

A scalable video codec for quality scalability (also known as Signal-to-Noise or SNR) and/or spatial scalability may be implemented as follows. For a base layer, a conventional non-scalable video encoder and decoder is used. The reconstructed/decoded pictures of the base layer are included in the reference picture buffer for an enhancement layer. In H.264/AVC, HEVC, and similar codecs using reference picture list(s) for inter prediction, the base layer decoded pictures may be inserted into a reference picture list(s) for coding/decoding of an enhancement layer picture similarly to the decoded reference pictures of the enhancement layer. Consequently, the encoder may choose a base-layer reference picture as inter prediction reference and indicate its use e.g. with a reference picture index in the coded bitstream. The decoder decodes from the bitstream, for example from a reference picture index, that a base-layer picture is used as inter prediction reference for the enhancement layer. When a decoded base-layer picture is used as prediction reference for an enhancement layer, it is referred to as an inter-layer reference picture.

Scalability modes or scalability dimensions may include but are not limited to the following: • Quality scalability: Base layer pictures are coded at a lower quality than enhancement layer pictures, which may be achieved for example using a greater quantization parameter value (i.e., a greater quantization step size for transform coefficient quantization) in the base layer than in the enhancement layer.

• Spatial scalability: Base layer pictures are coded at a lower resolution (i.e. have fewer samples) than enhancement layer pictures. Spatial scalability and quality scalability may sometimes be considered the same type of scalability.

• Bit-depth scalability: Base layer pictures are coded at lower bit-depth (e.g. 8 bits) than enhancement layer pictures (e.g. 10 or 12 bits).

• Dynamic range scalability: Scalable layers represent a different dynamic range and/or images obtained using a different tone mapping function and/or a different optical transfer function.

• Chroma format scalability: Base layer pictures provide lower spatial resolution in chroma sample arrays (e.g. coded in 4:2:0 chroma format) than enhancement layer pictures (e.g. 4:4:4 format).

• Color gamut scalability: enhancement layer pictures have a richer/broader color representation range than that of the base layer pictures - for example the enhancement layer may have UHDTV (ITU-R BT.2020) color gamut and the base layer may have the ITU-R BT.709 color gamut.

• Region-of-interest (ROI) scalability: An enhancement layer represents of spatial subset of the base layer. ROI scalability may be used together with other types of scalability, e.g. quality or spatial scalability so that the enhancement layer provides higher subjective quality for the spatial subset.

• View scalability, which may also be referred to as multiview coding. The base layer represents a first view, whereas an enhancement layer represents a second view.

• Depth scalability, which may also be referred to as depth-enhanced coding. A layer or some layers of a bitstream may represent texture view(s), while other layer or layers may represent depth view(s).

In all of the above scalability cases, base layer information could be used to code enhancement layer to minimize the additional bitrate overhead.

Scalability can be enabled in two basic ways. Either by introducing new coding modes for performing prediction of pixel values or syntax from lower layers of the scalable representation or by placing the lower layer pictures to the reference picture buffer (decoded picture buffer, DPB) of the higher layer. The first approach is more flexible and thus can provide better coding efficiency in most cases. However, the second, reference frame -based scalability, approach can be implemented very efficiently with minimal changes to single layer codecs while still achieving majority of the coding efficiency gains available. Essentially a reference frame based scalability codec can be implemented by utilizing the same hardware or software implementation for all the layers, just taking care of the DPB management.

An elementary unit for the output of encoders of some coding formats, such as HEVC and VVC, and the input of decoders of some coding formats, such as HEVC and WC, is a Network Abstraction Layer (NAL) unit. For transport over packet- oriented networks or storage into structured files, NAL units may be encapsulated into packets or similar structures.

A byte stream format may be specified for NAL unit streams for transmission or storage environments that do not provide framing structures. The byte stream format separates NAL units from each other by attaching a start code in front of each NAL unit. To avoid false detection of NAL unit boundaries, encoders may run a byte- oriented start code emulation prevention algorithm, which adds an emulation prevention byte to the NAL unit payload if a start code would have occurred otherwise. In order to enable straightforward gateway operation between packet- and stream-oriented systems, start code emulation prevention may always be performed regardless of whether the byte stream format is in use or not.

A NAL unit may be defined as a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of a raw byte sequence payload (RBSP) interspersed as necessary with emulation prevention bytes. A RBSP may be defined as a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit. An RBSP is either empty or has the form of a string of data bits containing syntax elements followed by an RBSP stop bit and followed by zero or more subsequent bits equal to 0.

NAL units consist of a header and payload. In WC, a two-byte NAL unit header is used for all specified NAL unit types, while in other codecs NAL unit header may be similar to that in WC. In WC, the NAL unit header comprises a five-bit NAL unit type indication (nal_unit_type), a three-bit nuh_temporal_id_plus1 indication for temporal level or sub-layer (may be required to be greater than or equal to 1 ) and a six-bit nuh_layer_id syntax element. The nuh_temporal_id_plus1 syntax element may be regarded as a temporal identifier for the NAL unit, and a zero-based Temporalld variable may be derived as follows: Temporalld = nuh_temporal_id_plus1 - 1 . The abbreviation TID may be used to interchangeably with the Temporalld variable. Temporalld equal to 0 corresponds to the lowest temporal level. The value of nuh_temporal_id_plus1 is required to be non-zero in order to avoid start code emulation involving the two NAL unit header bytes. The bitstream created by excluding all VCL NAL units having a Temporalld greater than or equal to a selected value and including all other VCL NAL units remains conforming. Consequently, a picture having Temporalld equal to tid_value does not use any picture having a Temporalld greater than tid_value as inter prediction reference. A sub-layer or a temporal sub-layer may be defined to be a temporal scalable layer (or a temporal layer, TL) of a temporal scalable bitstream. Such temporal scalable layer may comprise VCL NAL units with a particular value of the Temporalld variable and the associated non-VCL NAL units. nuh_layer_id can be understood as a scalability layer identifier.

NAL units can be categorized into Video Coding Layer (VCL) NAL units and non- VCL NAL units. VCL NAL units may be coded slice NAL units. In HEVC and WC, VCL NAL units contain syntax elements representing one or more CUs. In HEVC and WC, the NAL unit type value within a certain range indicates a VCL NAL unit, and the VCL NAL unit type may indicate a picture type.

A Random Access Point (RAP) picture, which may also be referred to as an intra random access point (IRAP) picture, when contained in an independent layer, may comprise only intra-coded image segments. Furthermore, a RAP picture may constrain subsequent pictures (within the same layer) in output order to be such that they can be correctly decoded without performing the decoding process of any pictures that precede the RAP picture in decoding order. There may be pictures in a bitstream that contain only intra-coded slices that are not IRAP pictures.

In some contexts, the term random-access picture may be used interchangeably with the term RAP picture or the term IRAP picture. In some contexts, a RAP picture or an IRAP picture may be defined as one category of random-access pictures, characterized in that they contain only intra-coded image segments, whereas other category or categories of random-access pictures may allow in-picture prediction, such as intra block copy.

Types and abbreviations for VCL NAL unit types may include the following: trailing (TRAIL), Temporal Sub-layer Access (TSA), Step-wise Temporal Sub-layer Access (STSA), Random Access Decodable Leading (RADL), Random Access Skipped Leading (RASL), Instantaneous Decoding Refresh (IDR), Clean Random Access (CRA). When all VCL NAL units of a picture have the same NAL unit type, the types and abbreviations may be used as picture types, trailing picture (a.k.a. TRAIL picture).

Some VCL NAL unit types may be more fine-grained as indicated in the paragraph above. For example, two types of IDR pictures may be specified, IDR without leading pictures, IDR with random access decodable leading pictures (i.e. without RASL pictures).

In WC, an IRAP picture may be a CRA picture or an IDR picture.

In HEVC and VVC, provided the necessary parameter sets are available when they are activated or referenced, an IRAP picture at an independent layer and all subsequent non-RASL pictures at the independent layer in decoding order can be correctly decoded without performing the decoding process of any pictures that precede the IRAP picture in decoding order.

In HEVC and WC, a CRA picture may be the first picture in the bitstream in decoding order, or may appear later in the bitstream. CRA pictures allow so-called leading pictures that follow the CRA picture in decoding order but precede it in output order. Some of the leading pictures, so-called RASL pictures, may use pictures decoded before the CRA picture (in decoding order) as a reference. Pictures that follow a CRA picture in both decoding and output order are decodable if random access is performed at the CRA picture, and hence clean random access is achieved similarly to the clean random access functionality of an IDR picture.

A CRA picture may have associated RADL or RASL pictures. When a CRA picture is the first picture in the bitstream in decoding order, the CRA picture is the first picture of a coded video sequence in decoding order, and any associated RASL pictures are not output by the decoder and may not be decodable, as they may contain references to pictures that are not present in the bitstream.

A leading picture is a picture that precedes the associated RAP picture in output order and follows the associated RAP picture in decoding order. The associated RAP picture is the previous RAP picture in decoding order (if present). In some coding specifications, such as HEVC and VVC, a leading picture is either a RADL picture or a RASL picture.

All RASL pictures are leading pictures of an associated IRAP picture (e.g. CRA picture). When the associated RAP picture is the first coded picture in the coded video sequence or in the bitstream, the RASL picture is not output and may not be correctly decodable, as the RASL picture may contain references to pictures that are not present in the bitstream. However, a RASL picture can be correctly decoded if the decoding had started from a RAP picture before the associated RAP picture of the RASL picture. RASL pictures are not used as reference pictures for the decoding process of non-RASL pictures. When present, all RASL pictures precede, in decoding order, all trailing pictures of the same associated RAP picture.

All RADL pictures are leading pictures. RADL pictures are not used as reference pictures for the decoding process of trailing pictures of the same associated RAP picture. When present, all RADL pictures precede, in decoding order, all trailing pictures of the same associated RAP picture. RADL pictures do not refer to any picture preceding the associated RAP picture in decoding order and can therefore be correctly decoded when the decoding starts from the associated RAP picture.

Two IDR picture types may be defined and indicated: IDR pictures without leading pictures and IDR pictures that may have associated decodable leading pictures (i.e. RADL pictures).

A trailing picture may be defined as a picture that follows the associated RAP picture in output order (and also in decoding order). Additionally, a trailing picture may be required not to be classified as any other picture type, such as STSA picture.

In HEVC there are two picture types, the TSA and STSA picture types that can be used to indicate temporal sub-layer switching points. WC specifies the STSA picture type similarly to the STSA picture type in HEVC. If temporal sub-layers with Temporalld up to N had been decoded until the TSA or STSA picture (exclusive) and the TSA or STSA picture has Temporalld equal to N+1 , the TSA or STSA picture enables decoding of all subsequent pictures (in decoding order) having Temporalld equal to N+1 . The TSA picture type may impose restrictions on the TSA picture itself and all pictures in the same sub-layer that follow the TSA picture in decoding order. None of these pictures is allowed to use inter prediction from any picture in the same sub-layer that precedes the TSA picture in decoding order. The TSA definition may further impose restrictions on the pictures in higher sub-layers that follow the TSA picture in decoding order. None of these pictures is allowed to refer a picture that precedes the TSA picture in decoding order if that picture belongs to the same or higher sub-layer as the TSA picture. TSA pictures have Temporalld greater than 0. The STSA is similar to the TSA picture but does not impose restrictions on the pictures in higher sub-layers that follow the STSA picture in decoding order and hence enable up-switching only onto the sub-layer where the STSA picture resides. In nested temporal scalability, all (trailing) pictures with Temporalld greater than 0 could be labeled as TSA pictures.

WC includes a gradual decoding refresh (GDR) picture type, which is similar to a trailing picture, but also guarantees that decoding can be started from a GDR picture and decoded pictures are correct in content at and after an indicated position (a.k.a. recovery point) in output order.

Images can be split into independently codable and decodable image segments (e.g. slices or tiles). Such image segments may enable parallel processing. Image segments may be coded as separate units in the bitstream, such as VCL NAL units in HEVC and VVC. Coded image segments may comprise a header and a payload, wherein the header contains parameter values needed for decoding the payload. For example, coded slices may comprise a slice header and slice data, where the slice header contains parameter values for decoding the coded image data contained in the slice data.

"Tiles" may refer to image segments that have been defined as rectangular image regions along a tile grid.

In the HEVC and WC standards, a picture can be partitioned in tiles, which are rectangular and contain an integer number of CTUs. In the HEVC and WC standards, a picture is divided into one or more tile rows and one or more tile columns and the partitioning of a picture to tiles forms a grid that may be characterized by a list of tile column widths (in CTUs) and a list of tile row heights (in CTUs). A tile is a sequence of coding tree units (CTUs) that covers one “cell” in the tile grid, i.e., a rectangular region of a picture. In HEVC, tiles are ordered in the bitstream consecutively in the raster scan order of the tile grid. In HEVC, a tile may contain an integer number of slices.

In the HEVC, a slice consists of an integer number of CTUs. The CTUs are scanned in the raster scan order of CTUs within tiles or within a picture, if tiles are not in use. A slice may contain an integer number of tiles or a slice can be contained in a tile. Within a CTU, the CUs have a specific scan order.

In HEVC, a slice is defined to be an integer number of coding tree units contained in one independent slice segment and all subsequent dependent slice segments (if any) that precede the next independent slice segment (if any) within the coded picture. In HEVC, a slice segment is defined to be an integer number of coding tree units ordered consecutively in the tile scan and contained in a single NAL (Network Abstraction Layer) unit. The division of each picture into slice segments is a partitioning. In HEVC, an independent slice segment is defined to be a slice segment for which the values of the syntax elements of the slice segment header are not inferred from the values for a preceding slice segment, and a dependent slice segment is defined to be a slice segment for which the values of some syntax elements of the slice segment header are inferred from the values for the preceding independent slice segment in decoding order. In HEVC, a slice header is defined to be the slice segment header of the independent slice segment that is a current slice segment or is the independent slice segment that precedes a current dependent slice segment, and a slice segment header is defined to be a part of a coded slice segment containing the data elements pertaining to the first or all coding tree units represented in the slice segment. The CUs are scanned in the raster scan order of LCUs within tiles or within a picture, if tiles are not in use. Within an LCU, the CUs have a specific scan order.

In WC, a slice may be defined as an integer number of complete tiles or an integer number of consecutive complete CTU rows within a tile of a picture that are exclusively contained in a single NAL unit. In WC, two modes of slices are supported, namely the raster-scan slice mode and the rectangular slice mode. In the raster-scan slice mode, a slice contains a sequence of tiles in a tile raster scan of a picture. In the rectangular slice mode, a slice contains a number of complete tiles or a number of contiguous and complete CTU rows within a single tile.

In WC, a slice is a video coding layer (VCL) NAL unit.

In video coding, an isolated region may be defined as a picture region that is allowed to depend only on the corresponding isolated region in reference pictures and does not depend on any other picture regions in the current picture or in the reference pictures. The corresponding isolated region in reference pictures may be for example the picture region that collocates with the isolated region in a current picture. A coded isolated region may be decoded without the presence of any picture regions of the same coded picture.

Pictures, whose isolated regions are predicted from each other, may be grouped into an isolated-region picture group. An isolated region can be inter-predicted from the corresponding isolated region in other pictures within the same isolated-region picture group, whereas inter prediction from other isolated regions or outside the isolated-region picture group may be disallowed.

A leftover region (a.k.a. non-isolated region) may be defined as a picture region that is not constrained like an isolated region and thus may be predicted from picture regions that do not correspond to the leftover region itself in the current picture or reference pictures.

WC supports subpictures (a.k.a. sub-pictures). A subpicture may be defined as a rectangular region of one or more slices within a picture, wherein the one or more slices are complete. Consequently, as subpicture consists of one or more slices that collectively cover a rectangular region of a picture. The slices of a subpicture may be required to be rectangular slices.

In WC, partitioning of a picture to subpictures may be indicated in and/or decoded from an SPS (Sequence Parameter Set); in other words, the subpicture layout may be indicated in and/or decoded from an SPS. The SPS syntax may indicate the partitioning of a picture to subpictures e.g. by providing for each subpicture syntax elements indicative of: the x and y coordinates of the top-left corner of the subpicture, the width of the subpicture, and the height of the subpicture, in coding tree units (CTU). Thus, a subpicture layout indicates the positions, widths, and heights of subpictures within a picture but does not assign subpictures or subpicture sequences of any particular identifiers to the subpicture layout.

In WC, one or more of the following properties may be indicated (e.g. by an encoder) or decoded (e.g. by a decoder) or inferred (e.g. by an encoder and/or a decoder) for the subpictures collectively or per each subpicture individually: i) whether or not a subpicture is treated as a picture in the decoding process; in some cases, this property excludes in-loop filtering operations, which may be separately indicated/decoded/inferred; ii) whether or not in-loop filtering operations are performed across the subpicture boundaries.

A WC subpicture with boundaries treated like picture boundaries is an example of an isolated region.

An motion-constrained tile set (MCTS) is such that the inter prediction process is constrained in encoding such that no sample value outside the MCTS, and no sample value at a fractional sample position that is derived using one or more sample values outside the motion-constrained tile set, is used for inter prediction of any sample within the motion-constrained tile set. Additionally, the encoding of an MCTS is constrained in a manner that motion vector candidates are not derived from blocks outside the MCTS. In HEVC, this may be enforced by turning off temporal motion vector prediction of HEVC, or by disallowing the encoder to use the temporal motion vector prediction (TMVP) candidate or any motion vector prediction candidate following the TMVP candidate in a motion vector candidate list for prediction units located directly left of the right tile boundary of the MCTS except the last one at the bottom right of the MCTS. In general, an MCTS may be defined to be a tile set that is independent of any sample values and coded data, such as motion vectors, that are outside the MCTS. An MCTS sequence may be defined as a sequence of respective MCTSs in one or more coded video sequences or alike. In some cases, an MCTS may be required to form a rectangular area. It should be understood that depending on the context, an MCTS may refer to the tile set within a picture or to the respective tile set in a sequence of pictures. The respective tile set may be, but in general need not be, collocated in the sequence of pictures. A motion-constrained tile set may be regarded as an independently coded tile set, since it may be decoded without the other tile sets. An MCTS is an example of an isolated region.

A non-VCL NAL unit may be for example one of the following types: a video parameter set (VPS), a sequence parameter set (SPS), a picture parameter set (PPS), an adaptation parameter set (APS), a supplemental enhancement information (SEI) NAL unit, a picture header (PH) NAL unit, an end of sequence NAL unit, an end of bitstream NAL unit, or a filler data NAL unit. Parameter sets may be needed for the reconstruction of decoded pictures, whereas many of the other non-VCL NAL units might not be necessary for the reconstruction of decoded sample values.

Some coding formats specify parameter sets that may carry parameter values needed for the decoding or reconstruction of decoded pictures. An example of parameter sets is described in this paragraph, but it needs to be understood that embodiments apply to any other parameter set definitions and relations too. The relationship and hierarchy between video parameter set (VPS), sequence parameter set (SPS), and picture parameter set (PPS) may be described as follows. VPS resides one level above SPS in the parameter set hierarchy. VPS may include parameters that are common across all layers in the entire coded video sequence or describe relations between layers. SPS includes the parameters that are common and remain unchanged for all slices in a particular layer in the entire coded video sequence. In addition to the SPS parameters that may be needed by the decoding process, the sequence parameter set may optionally contain video usability information (VUI), which includes parameters that may be important for buffering, picture output timing, rendering, and resource reservation. It may be possible to share an SPS by multiple layers. PPS includes the parameters that are common and remain unchanged for all slices of a coded picture and are likely to be shared by many coded pictures.

Many instances of parameter sets may be allowed in a bitstream, and each instance may be identified with a unique identifier. In order to limit the memory usage needed for parameter sets, the value range for parameter set identifiers has been limited. Each slice header (in HEVC) or each picture header (in WC) includes the identifier of the picture parameter set that is active for the decoding of the picture that contains the slice or the picture, respectively, and each picture parameter set contains the identifier of the active sequence parameter set. Consequently, the transmission of picture and sequence parameter sets does not have to be accurately synchronized with the transmission of slices. Instead, it is sufficient that the active sequence and picture parameter sets are received at any moment before they are referenced, which allows transmission of parameter sets “out-of-band” using a more reliable transmission mechanism compared to the protocols used for the slice data. For example, parameter sets can be included as a media parameter in the session description for Real-time Transport Protocol (RTP) sessions. If parameter sets are transmitted in-band, they can be repeated to improve error robustness.

Out-of-band transmission, signaling or storage can additionally or alternatively be used for other purposes than tolerance against transmission errors, such as ease of access or session negotiation. For example, a sample entry of a track in a file conforming to the ISO Base Media File Format may comprise parameter sets, while the coded data in the bitstream is stored elsewhere in the file or in another file. The phrase along the bitstream (e.g. indicating along the bitstream) may be used in claims and described embodiments to refer to out-of-band transmission, signaling, or storage in a manner that the out-of-band data is associated with the bitstream. The phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream.

A parameter set may be activated by a reference from a slice or from another active parameter set or in some cases from another syntax structure. A parameter set may be activated when it is referenced e.g. through its identifier. For example, a header of an image segment, such as a slice header, may contain an identifier of the PPS (a.k.a. PPS ID) that is activated for decoding the coded picture containing the image segment. A PPS may contain an identifier of the SPS that is activated, when the PPS is activated. An activation of a parameter set of a particular type may cause the deactivation of the previously active parameter set of the same type. The parameters of an activated parameter set may be used or referenced in the decoding process.

An adaptation parameter set (APS) may be defined as a syntax structure that applies to zero or more slices. There may be different types of adaptation parameter sets. An adaptation parameter set may for example contain filtering parameters for a particular type of a filter. In WC, three types of APSs are specified carrying parameters for one of: adaptive loop filter (ALF), luma mapping with chroma scaling (LMCS), and scaling lists. A scaling list may be defined as a list that associates each frequency index with a scale factor for the scaling process, which multiplies transform coefficient levels by a scaling factor, resulting in transform coefficients. In WC, an APS is referenced through its type (e.g. ALF, LMCS, or scaling list) and identifier. In other words, different types of APSs have their own identifier value ranges.

A parameter set, such as an SPS, may include constraints that the video data in the scope of the parameter set (e.g. the coded video sequence where the SPS is referenced) complies with. In WC, such constraints are included in the general constraints information syntax structure, which comprises e.g. the following syntax elements (where the OlsInScope variable specifies the layers which are in the scope of the syntax structure):

- gci_no_trail_constraint_flag equal to 1 specifies that there shall be no NAL unit with nuh_unit_type equal to TRAIL_NUT present in OlsInScope. gci_no_trail_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_stsa_constraint_flag equal to 1 specifies that there shall be no NAL unit with nuh_unit_type equal to STSA_NUT present in OlsInScope. gci_no_stsa_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_rasl_constraint_flag equal to 1 specifies that there shall be no NAL unit with nuh_unit_type equal to RASL_NUT present in OlsInScope. gci_no_rasl_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_radl_constraint_flag equal to 1 specifies that there shall be no NAL unit with nuh_unit_type equal to RADL_NUT present in OlsInScope. gci_no_radl_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_idr_constraint_flag equal to 1 specifies that there shall be no NAL unit with nuh_unit_type equal to IDR_W_RADL or IDR_N_LP present in OlsInScope. gci_no_idr_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_cra_constraint_flag equal to 1 specifies that there shall be no NAL unit with nuh_unit_type equal to CRA_NUT present in OlsInScope. gci_no_cra_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_gdr_constraint_flag equal to 1 specifies that sps_gdr_enabled_flag for all pictures in OlsInScope shall be equal to 0. gci_no_gdr_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_aps_constraint_flag equal to 1 specifies that there shall be no NAL unit with nuh_unit_type equal to PREFIX_APS_NUT or SUFFIX_APS_NUT present in OlsInScope, sps_ccalf_enabled_flag, sps_lmcs_enabled_flag, sps_scaling_list_enabled_flag, ph_num_alf_aps_ids_luma, ph_alf_cb_enabled_flag, and ph_alf_cr_enabled_flag for all pictures in OlsInScope shall all be equal to 0, and sh_num_alf_aps_ids_luma, sh_alf_cb_enabled_flag, sh_alf_cr_enabled_flag for all slices in OlsInScope shall be equal to 0. gci_no_aps_constraint_flag equal to 0 does not impose such a constraint. When no APS is referenced, it is still possible to set sps_alf_enabled_flag equal to 1 and use ALF.

- gci_one_tile_per_pic_constraint_flag equal to 1 specifies that each picture in OlsInScope shall contain only one tile, i.e., the value of NumTilesInPic for each picture shall be equal to 1 . gci_one_tile_per_pic_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_joint_cbcr_constraint_flag equal to 1 specifies that sps_joint_cbcr_enabled_flag for all pictures in OlsInScope shall be equal to 0. gci_no_joint_cbcr_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_sao_constraint_flag equal to 1 specifies that sps_sao_enabled_flag for all pictures in OlsInScope shall be equal to 0. gci_no_sao_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_alf_constraint_flag equal to 1 specifies that sps_alf_enabled_flag for all pictures in OlsInScope shall be equal to 0. gci_no_alf_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_ccalf_constraint_flag equal to 1 specifies that sps_ccalf_enabled_flag for all pictures in OlsInScope shall be equal to 0. gci_no_ccalf_constraint_flag equal to 0 does not impose such a constraint.

- gci_no_lmcs_constraint_flag equal to 1 specifies that sps_lmcs_enabled_flag for all pictures in OlsInScope shall be equal to 0. gci_no_lmcs_constraint_flag equal to 0 does not impose such a constraint.

Instead of or in addition to parameter sets at different hierarchy levels (e.g. sequence and picture), video coding formats may include header syntax structures, such as a sequence header or a picture header. A sequence header may precede any other data of the coded video sequence in the bitstream order. A picture header may precede any coded video data for the picture in the bitstream order.

In WC, a picture header (PH) may be defined as a syntax structure containing syntax elements that apply to all slices of a coded picture. In other words, contains information that is common for all slices of the coded picture associated with the PH. A picture header syntax structure is specified as an RBSP and is contained in a NAL unit.

In WC, the inclusion of a subpicture identifier (a.k.a. subpicture ID) in slice headers is controlled by an SPS syntax element sps_subpic_info_present_flag. When sps_subpic_info_present_flag is equal to 1 in the referenced SPS, a subpicture identifier (i.e. syntax element sh_subpic_id) is indicated (e.g. by an encoder) in and/or decoded (e.g. by a decoder) from a slice header. sh_subpic_id specifies the subpicture identifier of the subpicture that contains the slice. sh_subpic_id is coded as a fixed-length unsigned integer, u(v), whose length is indicated in the referenced SPS.

When rectangular slices are in use in WC, slice headers include the sh_slice_address syntax element, which is slice index of the slice within the subpicture identified by sh_subpic_id.

In WC, an SPS or PPS contains a list of subpicture identifier values, i.e. sps_subpic_id[ i ] or pps_subpic_id[ i ], respectively, for i in the range of 0 to the number of subpicture in the subpicture layout minus 1 , inclusive. It may be indicated in the SPS that the list of subpicture identifier values is unchanged for the entire coded video sequence. If the SPS allows changes of the list of subpicture identifier values within a coded video sequence, pps_subpic_id[ i ], when present, applies to the picture and specifies the subpicture ID of the i-th subpicture. Decoders may use the list of subpicture identifier values to locate decoded subpictures in a correct position according to the subpicture layout.

In WC, the PPS syntax includes a 1 -bit syntax element (i.e. a flag) named pps_mixed_nalu_types_in_pic_flag. When equal to 1 , pps_mixed_nalu_types_in_pic_flag specifies that each picture referring to the PPS has more than one VCL NAL unit and that the VCL NAL units do not have the same value of nal_unit_type. pps_mixed_nalu_types_in_pic_flag also imposes the following additional constraints: i) There shall be no VCL NAL unit of the picture that has nal_unit_type equal to GDR_NUT. ii) When a VCL NAL unit of the picture has nal_unit_type equal to nalUnitTypeA that is equal to IDR_W_RADL, IDR_N_LP, or CRA_NUT, other VCL NAL units of the picture shall all have nal_unit_type equal to nalUnitTypeA or TRAIL_NUT. mixed_nalu_types_in_pic_flag equal to 0 specifies that each picture referring to the PPS has one or more VCL NAL units and the VCL NAL units of each picture referring to the PPS have the same value of nal_unit_type. All VCL NAL units of a subpicture have the same nal_unit_type value. Thus, the description and constraints discussed earlier for different picture types (e.g. IRAP pictures, RADL, RASL, trailing) apply on subpicture basis whenever multiple subpictures are in use. Furthermore, the types and abbreviations may act as qualifiers for subpictures, e.g. IRAP subpicture or trailing subpicture.

The present embodiments enable merging of multiple video streams in a coded domain into a single video bitstream. In the present description the encoded input bitstreams are called “incoming bitstreams” and the merged (output) bitstream that is to be decoded is called “destination bitstream”. The merging, as discussed in the present specification, may be needed in use cases, such as, for example, 1 ) Multipoint Control Unit that mixes/merges media content coming from several endpoints; 2) Multiple cameras that capture visual content, after which individual captures are spatially merged to a single picture stream; 3) Game streaming; 4) Multiple incoming video elementary streams that are merged as subpictures or alike into an output video bitstream.

Case 4 in the previous paragraph may apply for example when an overlay video bitstream or alike is merged into the same output bitstream with a "background" video bitstream. An overlay may for example represent an advertisement, and the background video bitstream may represent the actual video content. In an embodiment, the merging is done by replacing one or more subpictures of the background video bitstream with one or more subpictures of the overlay video bitstream. In another embodiment, the merging is done by including both the background video bitstream and the overlay video bitstream as non-overlapping sets of subpictures in the output video bitstream, and including rendering instructions in or along the output video bitstream to display the overlay on a desired position on top of the background video.

Figure 2 illustrates an example of multipoint video conferencing, comprising the Multipoint Control Unit (MCU) merging the content from different endpoints A, B, C, representing the use case example 1 , above. In the example of Figure 1 , endpoints A, B, C are user devices that are configured to run an RTP point-to-point session between itself and the MCU. The content that the MCU provides to each participant (i.e. endpoints A, B, C) is either a selection of the content received from the other endpoints, or a mixed aggregate of what the MCU receiver from other point-to-point paths, which are part of the same communication session.

Use case example 2, above, may relate to e.g. a surveillance system having multiple cameras, where the multiple surveillance cameras connect to a transmission hub that spatially merges the feeds provided by the surveillance cameras.

In use case example 3, above, a video stream may be a spatial mix of one or more viewpoints of the game, optionally also supplemented with a camera feed of the player. The merging of multiple encoded video sources may be performed e.g. in a transmitting device or in a server.

The merging as discussed in use case example 4, above, can be done e.g. in a client device, e.g. in a video decoding interface implementation that formats the input bitstreams for a single video decoder instance for decoding. This may be a function of “input formatting” in the architecture for MPEG Video Decoding Interface specification, ISO/IEC 23090-13, currently having working draft status. Figure 3 illustrates video decoding engine and interfaces, according to ISO/IEC 23090-13. In the example of Figure 3, multiple elementary streams and metadata streams are received as an input by the video decoding engine. The input streams are decoded and formatted into a multiple output streams relating to the metadata and decoded output sequence.

The above-described use cases may be alternatively or additionally realized using the NBMP architecture. The use cases above include several input video bitstreams or several video sources, each of which may be regarded as a Media Source in the NBMP architecture. Merging of the input bitstreams spatially into an output bitstream may be regarded as one or more Tasks performed by the Media Processing Entity, and the output bitstream may be considered to be the Media flow to the Media Sink in the NBMP architecture.

It is worth noticing that if the same device encodes multiple video sources that are intended to be merged into the same destination bitstream, it may be chosen to encode the video sources into separate bitstreams, e.g. due to the following reasons:

- It is envisioned that there can be video encoder implementations that do not provide support for isolated regions. For example there may be WC encoder implementations that do not provide support for multiple subpictures at all or do not provide support for multiple subpictures whose boundaries are treated like picture boundaries.

- Parallel encoding, e.g. in separate processing cores, can be more easily achieved when encoding the video sources with separate encoder instances compared to encoding with a single encoder instances with multiple subpictures. When running multiple encoder instances, the encoder implementation does not need to include handling of separate processes or threads per each subpicture.

When input bitstreams are decoded, the decoded video are spatially arranged in an uncompressed video sequence, and the uncompressed video sequence is then encoded. However, this kind of approach may increase end-to-end delay, and may cause picture quality degradation, since the content is encoded twice (first at the source, and then when spatial composition is encoded).

Merging in compressed domain incurs at least the following drawbacks:

- Encoding needs to be coordinated. For example, it may be necessary to apply the same inter prediction hierarchies and the same reference picture sets when the same picture rate is used in different encoders. Furthermore, it may be required to have the same assignment of picture types in time- aligned pictures between different encodings. Moreover, encoding may need to be constrained so that merging as isolated regions into the same bitstream is possible, which may require e.g. constraining the motion vectors and other constraints similarly to what was described earlier in relation to HEVC MCTS.

- H.266/VVC relaxes some encoding constraints that enables merging pictures from different bitstreams as subpictures in a destination bitstream. For example: o Subpicture boundaries can be treated like picture boundaries, hence no motion constraints are needed in encoding o Subpictures of different VCL NAL unit types may be mixed (in a constrained manner)

However, H.266/VVC still has constraints that hinder merging pictures from different bitstreams as subpictures in a destination bitstream. For example: o All slices of a picture must have the same Temporalld (i.e. represent the same temporal sublayer) o All slices of a picture must indicate the same set of reference pictures o Coded pictures must be complete, i.e. coded data for all CTUs of a coded picture must be present, or equivalently each subpicture must be present in the coded picture o A picture has a single picture order count (POC)

- Sources might not be synchronized, e.g. they might not capture pictures exactly at identical times, might not capture at the same picture rate, might have encoding delay variation, and/or may experience different delay or delay variations in their uplink channel to the media mixer.

- Since pictures must be complete according to some video coding standards (e.g. H.264/AVC, H.265/HEVC, and H.266/WC), synchronizing content for mixing may incur additional delays.

The present embodiments comprises two main aspects: 1 ) signaling for coordinating encoders; and 2) operation of a media mixer; which are discussed separately in the following:

1. Signaling for coordinating encoders

In this aspect, an entity creates and transmits a control signal to control more than one video encoder. Alternatively, the control signal enables controlling one video encoder providing several separate media streams.

The entity that creates and transmits the control signal may for example be the entity that also merges bitstreams created by video encoder(s) in coded domain to another video bitstream. In this description a term “media mixer” is used for describing such entity.

The control signal is created so as to enable merging of bitstreams in coded domain. This requires from the encoders to use e.g. the same inter prediction hierarchies and reference picture sets in the same picture rate, the same assignment of picture types, motion constraints (e.g. HEVC MCTS), just to mention few. Therefore, the control signal may include but is not limited to one or more of the following:

- Reference picture set to be used for all pictures, e.g. specified as the number of reference pictures and their differential picture order count values relative to a current picture; - Signal to use temporally forward prediction only, i.e. no temporally bidirectional prediction; or signal the picture prediction hierarchy and the sublayer hierarchy to be used;

- Relation of picture order count differences to picture time interval o E.g. Signal that a POC interval of 1 corresponds to one picture interval at 60Hz picture rate, i.e. to 1/60 seconds

- Picture types or VCL NAL unit types that are allowed o E.g. Signal that only IDR (Instantaneous Decoding Refresh), STSA (Stepwise Temporal Sublayer Access) and TRAIL VCL NAL unit types are allowed

- Picture width and height (in luma samples);

- The PPS (Picture Parameter Set) ID (Identifier) referenced by coded video data. In WC, the PPS ID is included in picture headers (as ph_pic_parameter_set_id syntax element);

- Subpicture ID value to be included in slice headers;

- Constrains which coding tools are disabled and/or which signaling options are used. o For example, all coding tools that are controlled in the picture header in a manner that cannot be overridden in the slice header or slice data may be controlled with constraints for encoding. o In another example, all coding tools that could be adaptively turned on or off and/or adaptively parameterized, e.g. based on the picture content, may be controlled with constraints for encoding. o For example, constraints for encoding may be included, but are not limited to one or more of the following:

■ Constraint to turn off adaptive loop filtering (ALF).

■ Constraint to use the default (pre-defined) ALF filters only.

■ Constraint to use an indicated set of ALF filters (provided to the encoder), or the pre-defined ALF filters.

■ Constraint to use a generate and use ALF APSs of certain ALF APS ID value(s) only indicated to the encoder, and refrain

■ Constraints to turn off luma mapping with chroma scaling (LMCS) or to use an indicated set of LMCS parameters.

■ Constraints to turn off scaling lists, or to use an indicated set of scaling lists.

■ Constraints to use a particular value of joint chroma coding residual sign flag (i.e. ph_joint_cbcr_sign_flag in VVC). ■ Constraints to turn off sample adaptive offset (SAO) filtering, or to use the same indicated SAO parameters.

■ Constraint to use only a single tile in a coded picture, or to use an indicated tile grid and/or indicated tile width and height.

■ Constraint to use a single subpicture in a coded picture, indicate a subpicture ID value in each slice header, and the subpicture ID value to be included in the slice headers.

■ Constraint to include signaling a set of reference pictures or reference picture list(s) in VCL NAL units where it would be optional, such as in the VCL NAL unit types of IDR (both with and without RADL pictures) in VVC.

In an embodiment, the control signal is formatted for example as one or more optional media parameters (a.k.a. MIME parameters).

In an embodiment, an offerer includes the control signal within one or more optional media parameters in SDP offer/answer negotiation. For example, the offerer may comprise or be operationally connected with a media mixer and may provide an initial offer to the answerer. The initial offer may comprise, e.g. for the recvonly direction one or more media formats, which may be constrained by the optional media parameters conveying the control signal. The answerer may comprise or be operationally connected with a source video encoder. The answerer may respond to the initial offer by selecting one of the offered media formats in the initial offer.

In an embodiment, an answerer includes the control signal within one or more optional media parameters in SDP offer/answer negotiation. For example, the offerer may comprise or be operationally connected with a source video encoder and may provide an initial offer to the answerer. The initial offer may comprise one or more media formats e.g. for the sendonly direction. The answerer may comprise or be operationally connected with a media mixer. The answerer may respond to the initial offer by including, e.g. for the recvonly direction, one or more media formats of the initial offer constrained by the optional media parameters conveying the control signal.

In an embodiment, media parameters carrying the control signal may comprise one or more of the following. It needs to be understood that embodiments similarly apply to any variations of the above-described media parameters, such as using whitespace-separated lists instead of comma-separated lists, base16 (hexadecimal) representation instead of base10 or base64 representation.

- ref-pics: a comma-separated list of differential picture order count values relative to a current picture, each list entry identifying a reference picture that may be used as reference for predicting the current picture. Pictures that are not identified by the ref-pics list are not used as reference for the current picture. The list entries are signed integers in base10 (decimal) format.

- temporally-forward-prediction: when present, indicates that the decoding order is the same as output order and only temporally forward prediction is applied.

- poc-tick-time: when present, indicates that the picture order count value for each picture in a coded video sequence (CVS) that is not the first picture in the CVS, in decoding order, is proportional to the output time of the picture relative to the output time of the first picture in the CVS. The parameter has one unsigned integer value in base10 (decimal) format, which specifies the number of clock ticks of the 90 kHz clock that each picture order count difference of 1 corresponds to.

- pps-id: a comma-separated list of PPS ID values that may be referenced by coded video data. PPS ID values that are not among the pps-id list are not present in the stream. The list entries are unsigned integers in base10 (decimal) format.

- subpicture-id: when present, indicates that each picture contains a single subpicture only. The parameter has one unsigned integer value in base10 (decimal) format, which specifies the subpicture identifier value in all slices of the stream.

- general-constraints: a base64 representation of the general constraints information syntax structure imposing constraints on the coded video data

- picture-header: a base64 representation of the picture header syntax structure; the indicated picture header is used in all coded pictures with an exception of certain syntax elements, such as syntax element(s) indicating the POC value for a picture (e.g. ph_pic_order_cnt_lsb in H.266/WC).

- vcl-nal-unit-types: a comma-separated list of NAL unit type values as specified in the coding format (e.g. nal_unit_type values of WC) that may be present in the stream. VCL NAL units with a NAL unit type that is not included among the vcl-nal-unit-types list are not present in the stream. The list entries are unsigned integers in base10 (decimal) format. - alf: may have one of the following values: "off" indicating that ALF is not in use; "default" indicating that only the default filters pre-defined in a coding standard (e.g. VVC) are in use and the coded video data does not reference any ALF APS; or a comma-separated list of unsigned integer values in base10 (decimal) format, each providing an ALF APS ID value that may be referenced by the coded video data, and no ALF APS with an ID not among the list is referenced by the coded video data.

- joint-cbcr-sign-flag: equal to 0 or 1 , and providing the value for ph_joint_cbcr_sign_flag for the coded video data.

- Imcs-off, scaling-list-off, and sao-off, when present, indicating that LMCS, scaling lists, and SAO, respectively are disabled in the stream. Other coding tools or features may be disabled similarly with other corresponding parameters.

- single-tile: when present, indicating that each picture contains only one tile.

- single-subpicture: when present, indicating that each picture contains only one subpicture.

- idr-rpl: when present, indicating that all IDR VOL NAL units include signaling of reference picture lists.

The examples of the control signals to the encoders can comprise, for example:

- number of reference pictures = 4;

- differential picture order count values = -1 , -2, -3, -4;

- signal to use temporally forward prediction only;

- signal that POC interval of 1 corresponds to picture interval at 60Hz picture rate;

- signal that only IDR, STSA and TRAIL VCL NAL unit types are allowed.

Figure 4 illustrates example of encoding structures. Reference sign I refers to IDR picture type; T refers to TRAIL picture type; and RPL refers to reference picture list. From the list of encoding structures according to Figure 4, the encoders can choose any obeying the control signal. The examples of the encoding structure in Figure 4 comprise 60-Hz hierarchical encoding and 60-Hz “linear” encoding, where only the delta POC -1 is an active reference picture, but delta POC -2, -3 and -4 are included as inactive entries in RPL(s). The examples of the encoding structure in Figure 4 also comprise 30-Hz hierarchical encoding and 30-Hz “linear” encoding, where only the delta POC -2 is an active reference picture, but delta POC -1 , -3 and -4 are included as inactive entries in RPL(s). 2. Operation of a media mixer

In this description, term “media mixer” is used for any entity that merges bitstreams created by video encoders in coded domain to another video bitstream. It is, however, appreciated that the term “media mixer” is selected for its descriptive nature.

It is remarked that a media mixer may operate in various entities, including but not limited to the following:

- The media mixer can be included in the originating endpoint device that generates and/or captures the bitstreams, but also encodes them. This could be beneficial e.g. in the game streaming use case.

- The media mixer can be included in a network entity separate from any of the originating or destination endpoint devices. For example, the media mixer may run in a multipoint control unit. In another example, the media mixer may be included a Media Processing Entity (MPE) and/or may comprise one or more Tasks in the NBMP architecture.

- The media mixer can be included in a destination endpoint device that - after having received the bitstream - also decodes the bitstream and may render or display the decoded video resulting from the bitstream.

A media mixer may support any transport protocols, such as RTP and HTTP, and any session description and/or initialization protocols, such as SDP. A media mixer may be an RTP mixer.

The media mixer merges input bitstream (a.k.a. incoming bitstreams) into one output bitstream (a.k.a. destination bitstream). Each input bitstream corresponds to a certain subpicture in the subpicture layout of the output bitstream, or in other words, to a sequence of collocated subpictures within coded picture of the output bitstream.

The operation of a media mixer includes the following embodiments: POC adjustment; and Handling of “missing” pictures in incoming bitstreams.

POC adjustment: POC adjustment may be needed by the media mixer, since POC values of the incoming bitstreams may not be aligned. The reason(s) for not having aligned POC values may include, but is not necessarily limited to, one or more of the following:

- source encoders have not started the encoding exactly at the same time;

- source encoders have included IDR (or other IRAP) pictures at different positions, and IDR (or other IRAP) picture may reset the POC values;

- source encoders might have had different schemes for assigning picture intervals in time to POC value differences;

- different source video clips might have had different picture rates.

Adjustment of POC values becomes possible e.g. when the following controls of the control signal are used:

- reference picture set to be used for all pictures, e.g. specified as the number of reference pictures and their differential picture order count values relative to a current picture

- signal to use temporally forward prediction only, i.e. no temporally bidirectional prediction.

In this embodiment, the media mixer assigns POC values into the destination bitstream. In other words, it overwrites the POC values of the incoming bitstreams.

The above-mentioned control signals enable the POC value overwriting, since the reference picture set and the decoding order of pictures with respect to output order are invariant in all pictures of the incoming bitstreams. Thus, even if the source encoders were not synchronized or used a content-adaptive way to determine between IRAP and trailing pictures, the incoming bitstreams can be merged as subpictures in a destination bitstream.

Handling of “missing” pictures in incoming bitstreams

A picture may appear missing from an incoming bitstream from media mixer’s perspective e.g. due to the following reasons:

- different source video clips and/or source encoders may use different picture rates;

- a picture from an incoming bitstream may be delayed e.g. due to transmission delay variation;

- encoders might start encoding at different times Consequently, there may be no coded video content available in an input bitstream to be merged as a subpicture.

In this embodiment, a media mixer is configured to include a “not-coded” subpicture into the destination bitstream when there is no corresponding picture available in the corresponding incoming bitstream.

The “not-coded” subpicture may for example be

- an IDR subpicture (e.g. with a constant color such as mid-gray). This may be used particularly if the destination picture is in the first picture of the destination bitstream.

- a trailing subpicture that is encoded so that it is reconstructed by just copying the collocated subpicture from the previous picture (motion vectors equal to 0, no prediction error).

Figure 5 illustrates an example where the source encoders are controlled like in the first aspect, but start encoding at different times. Both encoders use only one sublayer, different prediction hierarchies, and indicate reference picture lists containing POCs -1 , -2, -3 and -4, but different pictures being the active entries.

The media mixer may handle this by creating “not-coded” subpictures as described above and as illustrated in Figure 6. In the example of Figure 6, “i” refers to not- coded IDR subpicture, and “t” refers to not-coded trailing subpicture.

In an embodiment, where a coded picture in a destination bitstream has mixed VCL NAL unit types and the previous PPS in the destination bitstream with the reference PPS ID indicates no mixing, or vice versa, the PPS is rewritten (with pps_mixed_nalu_types_in_pic_flag set to an opposite value compared to that in the previous PPS ID) and included in the destination bitstream.

Flattening the sublayer structure

When encoders have been controlled in a manner that allows them to use any number of sub-layers, the respective pictures in incoming bitstreams may be labeled with a different Tempolalld (TID). However, all slices of a picture in the destination bitstream must have the same TID. This issue could be partly avoided by signalling the picture prediction hierarchy and the sublayer hierarchy to be used as a part of the control signal. However, even so, since encoders might start encoding at different times or might reset the use of the hierarchies when there is a scene cut (which may appear at different times in different source video clips), this embodiment may be helpful.

However, by allowing only a single sublayer to be used, the above issue could be avoided.

According to an embodiment, the media mixer is configured to assign all VCL NAL units to sublayer 0, and rewrites VPS(s) and SPS(s) accordingly.

Handling of two pictures in an incoming bitstream in one time interval

When there are encoding or transmission delay variations, two pictures of an incoming encoded bitstream may arrive in the media mixer within an interval where only one subpicture is created to a destination bitstream.

In an embodiment, when the incoming encoded bitstream involves multiple temporal sublayers, the picture in the higher temporal sublayer is omitted. Subsequent pictures of that higher temporal sublayer are not included in the destination bitstream until there is an IRAP picture or an STSA picture in that higher temporal sublayer in the incoming bitstream.

Then, according to an embodiment, the media mixer requests a refresh of the temporal sublayer from which the picture has been omitted. For example, temporal sublayer refresh request can be added among the Codec Control Messages.

Thereafter, in an embodiment, an encoder (or a device running an encoder) receives a temporal sublayer refresh request and, in response to the request, encodes a sublayer access picture (e.g. STSA in WC).

Additional steps in the media mixer

The media mixer may additionally perform one or more of the following:

- discarding all parameter set and picture header NAL units of the input bitstreams except for those of the first subpicture that are used as the basis for the merged stream; - writing subpicture information to the SPS(s) of the merged stream;

- writing picture partitioning information (to tiles and slices) to the PPS(s) of the merged stream;

- changing picture headers (when used in input bitstreams) to be coded as PH NAL units, if they were originally coded within slice header;

- writing subpicture IDs to the slice headers, if the slice header did not originally contain them;

- discarding all SEI NAL units.;

- taking all other non-VCL NAL units from the bitstream of the first subpicture only.

The method according to an embodiment is shown in Figure 8. The method generally comprises creating 710 a control signal for controlling encoding of a bitstream; transmitting 720 the control signal to more than one encoders; receiving 730 a plurality of encoded media streams, which have been encoded according to the transmitted control signal; merging 740 encoded input bitstreams into one output bitstream; and providing 750 the output bitstream comprising the merged encoded input bitstreams for decoding. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises means for creating a control signal for controlling encoding of a bitstream; means for transmitting the control signal to more than one encoders; means for receiving a plurality of encoded media streams, which have been encoded according to the transmitted control signal; means for merging encoded input bitstreams into one output bitstream; and means for providing the output bitstream comprising the merged encoded input bitstreams for decoding. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 8 according to various embodiments.

An example of a data processing system for an apparatus is illustrated in Figure 9. Several functionalities can be carried out with a single physical device, e.g. all calculation procedures can be performed in a single processor if desired. The data processing system comprises a main processing unit 100, a memory 102, a storage device 104, an input device 106, an output device 108, and a graphics subsystem 110, which are all connected to each other via a data bus 112.

The main processing unit 100 is a conventional processing unit arranged to process data within the data processing system. The main processing unit 100 may comprise or be implemented as one or more processors or processor circuitry. The memory 102, the storage device 104, the input device 106, and the output device 108 may include conventional components as recognized by those skilled in the art. The memory 102 and storage device 104 store data in the data processing system 100.

Computer program code resides in the memory 102 for implementing, for example a method as illustrated in a flowchart of Figure 8 according to various embodiments. The input device 106 inputs data into the system while the output device 108 receives data from the data processing system and forwards the data, for example to a display. The data bus 112 is a conventional data bus and while shown as a single line it may be any combination of the following: a processor bus, a PCI bus, a graphical bus, an ISA bus. Accordingly, a skilled person readily recognizes that the apparatus may be any data processing device, such as a computer device, a personal computer, a server computer, a mobile phone, a smart phone or an Internet access device, for example Internet tablet computer.

Figure 10 illustrates an example of a video encoder, where l_n: Image to be encoded; P’_n: Predicted representation of an image block; D_n: Prediction error signal; D’_n: Reconstructed prediction error signal; I’n: Preliminary reconstructed image; R’n: Final reconstructed image ; T, T’¹: Transform and inverse transform; Q, Q^-1: Quantization and inverse quantization; E: Entropy encoding; RFM: Reference frame memory; Pinter: Inter prediction; Pint_ra: Intra prediction; MS: Mode selection; F: Filtering. Figure 11 illustrates a block diagram of a video decoder where P’_n: Predicted representation of an image block; D’_n: Reconstructed prediction error signal; l’_n: Preliminary reconstructed image; R’_n: Final reconstructed image; T’¹: Inverse transform; Q^-1: Inverse quantization; E’¹: Entropy decoding; RFM: Reference frame memory; P: Prediction (either inter or intra); F: Filtering. An apparatus according to an embodiment may comprise only an encoder or a decoder, or both.

The various embodiments may provide advantages. For example, compared to decoding and re-encoding in a media mixer: the present embodiments provide lower delay, since they avoid decoding and re-encoding delays; require less computational and memory resources in the media mixer. Consequently, the media mixer can support a higher number of endpoints. The media mixer only needs to rewrite parameter sets (SPS, PPS), and no rewriting of VCL NAL units is needed. In addition, the present embodiments provide better picture quality, since there is no picture quality degradation caused by re-encoding.

The present embodiments require only one decoder instance as opposed to a decoder instance per each incoming bitstream.

The present embodiments also may enable less total bandwidth used in the network due to the following reasons:

- In a multipoint conferencing scenario or alike, the embodiments require less total bandwidth in the network and fewer connections/sessions compared to transmitting each source bitstream to each endpoint (illustrated in Figure 7). In the present embodiments, the source bitstream is sent only as one copy to the MCU and the MCU forwards only one bitstream of endpoints. It is to be noticed, that the present embodiments could be combined with this architecture without an MCU by performing the media mixing locally in each endpoint.

Some embodiments were described with reference to an entity that both creates a control signal for controlling encoding of a bitstream and merges encoded input bitstreams into one output bitstream. It needs to be understood that embodiments similarly apply when a first entity creates a control signal for controlling encoding of a bitstream and a second entity merges encoded input bitstreams into one output bitstream, wherein the first entity and the second entity are coordinated by a third entity. For example, the third entity may be an NBMP Source, which controls a Media Processing Entity (MPE) through an NBMP Workflow Manager, wherein the MPE contains one or more Task(s) that merges encoded input bitstreams into one output bitstream, and creates a control for controlling Media Source that encodes a bitstream that is provided as input to the MPE.

Some embodiments were described in relation to SDP and/or SDP offer/answer negotiation. It needs to be understood that embodiments apply similarly to other media description formats or protocols and/or other session negotiation models or protocols. Some embodiments were described in relation to H.266/WC and its coding tools or features. It needs to be understood that embodiments apply similarly to other video codecs and formats, and/or other coding tools and features. For example, some embodiments described with reference to H.266/WC subpictures apply to image segments constrained similarly to H.266/VVC subpictures.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises creating a control signal for controlling encoding of a bitstream; transmitting the control signal to more than one encoders; receiving a plurality of encoded media streams, which have been encoded according to the transmitted control signal; merging encoded input bitstreams into one output bitstream; and providing the output bitstream comprising the merged encoded input bitstreams for decoding.

A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

56 Claims:

1 . An apparatus comprising:

- means for creating a control signal for controlling encoding of a bitstream;

- means for transmitting the control signal to more than one encoders;

- means for receiving a plurality of encoded media streams, which have been encoded according to the transmitted control signal;

- means for merging encoded input bitstreams into one output bitstream; and

- means for providing the output bitstream comprising the merged encoded input bitstreams for decoding.

2. The apparatus according to claim 1 , wherein an input bitstream corresponds to a subpicture in the output bitstream.

3. The apparatus according to claim 1 or 2, wherein the control signal comprises one or more of the following:

- reference picture set for all pictures;

- signal to use temporally forward prediction only;

- signal the picture prediction hierarchy;

- relation of picture order differences to picture time interval;

- allowed picture types or Video Coding Layer Network Abstraction Layer unit types;

- picture width and height;

- picture parameter set identifier value referenced by coded video data;

- subpicture identifier value to be included in slice headers; and

- constraints which coding tools are disabled.

4. The apparatus according to any of the claims 1 to 3, further comprising means for assigning picture order count values into a output bitstream by overwriting the picture order count values of the input bitstream.

5. The apparatus according to any of the claims 1 to 4, further comprising means for including a not-coded subpicture into the output bitstream, where there is no corresponding picture in the corresponding input bitstream. 57

6. The apparatus according to claim 5, wherein the not-coded subpicture may be one of the following: an IDR subpicture or a trailing subpicture being encoded by reconstructing it by copying the collocated subpicture from the previous picture.

7. The apparatus according to any of the claims 1 to 6, further comprising means for signalling a picture prediction hierarchy and a sublayer hierarchy to be used as a part of the control signal.

8. The apparatus according to any of the claims 1 to 7, further comprising means for determining that an encoded media streams comprises multiple temporal sublayers, whereupon the apparatus is caused to omit a picture in the higher temporal sublayer and not to include such in the output bitstream and to request a refresh of the temporal sublayer from which the picture was omitted.

9. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- create a control signal for controlling encoding of a bitstream;

- transmit the control signal to more than one encoders;

- receive a plurality of encoded media streams, which have been encoded according to the transmitted control signal;

- merge encoded input bitstreams into one output bitstream; and

- provide the output bitstream comprising the merged encoded input bitstreams for decoding.

10. A method, comprising:

- creating a control signal for controlling encoding of a bitstream;

- transmitting the control signal to more than one encoders;

- receiving a plurality of encoded media streams, which have been encoded according to the transmitted control signal;

- merging encoded input bitstreams into one output bitstream; and

- providing the output bitstream comprising the merged encoded input bitstreams for decoding.

11 . The method according to claim 10, wherein an input bitstream corresponds to a subpicture in the output bitstream. 58

12. The method according to claim 10 or 11 , wherein the control signal comprises one or more of the following:

- reference picture set for all pictures;

- signal to use temporally forward prediction only;

- signal the picture prediction hierarchy;

- relation of picture order differences to picture time interval;

- picture width and height;

- picture parameter set identifier value referenced by coded video data;

- subpicture identifier value to be included in slice headers; and

- constraints which coding tools are disabled.

13. The method according to any of the claims 10 to 12, further comprising assigning picture order count values into a output bitstream by overwriting the picture order count values of the input bitstream.

14. The method according to any of the claims 10 to 13, further comprising including a not-coded subpicture into the output bitstream, where there is no corresponding picture in the corresponding input bitstream.

15. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system:

- to create a control signal for controlling encoding of a bitstream;

- to transmit the control signal to more than one encoders;

- to receive a plurality of encoded media streams, which have been encoded according to the transmitted control signal;

- to merge encoded input bitstreams into one output bitstream; and

- to provide the output bitstream comprising the merged encoded input bitstreams for decoding.