WO2010069427A1

WO2010069427A1 - Method and encoder for providing a tune- in stream for an encoded video stream and method and decoder for tuning into an encoded video stream

Info

Publication number: WO2010069427A1
Application number: PCT/EP2009/007649
Authority: WO
Inventors: Harald Fuchs; Stefan DÖHLA; Ulf Jennehag
Original assignee: Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Priority date: 2008-12-19
Filing date: 2009-10-26
Publication date: 2010-06-24

Abstract

For an encoded video stream (200) a tune-in stream (202) is provided. The encoded video stream comprises a plurality of intra-coded pictures and a plurality of inter-coded pictures, wherein each picture comprises a plurality of macroblocks. The encoded video stream (200) comprises a plurality of frames, and the plurality of macroblocks of an intra-coded picture are spread among a plurality of the frames. The tune-in stream is provided and comprises a plurality of tune-in pictures (202₁202₄), wherein a tune-in picture (202₁-202₄) is provided for a frame of the encoded video stream (200) that comprises an intra-coded macroblock of an intra-coded picture, wherein the tune-in picture (202₁-202₄) comprises the remaining intra-coded macroblocks of the intra-coded picture. Also, an encoder for providing such a tune-in stream as well as a method and a decoder for tuning into an encoded video stream are described.

Description

Method and Encoder for Providing a Tune-In Stream for an Encoded Video Stream and Method and Decoder for Tuning into an Encoded Video Stream

Description

Embodiments of the invention relate to the field of providing tune-in streams for allowing a fast tune-in into an encoded video stream, for example for allowing a fast channel change based on a tune-in stream. More specifically, embodiments of the invention relate to a method for providing a tune-in stream for an encoded video stream and an associated encoder as well as to a method for tuning into an encoded video stream and an associated decoder. More specifically, embodiments of the invention provide gradual tune-in pictures for a fast tune-in or channel change.

In the old days of analogue TV, channel change was instantaneous and not an issue. With the advent of the current digital TV broadcast systems and especially newer IPTV multicast systems much higher tune-in times in the range of seconds were observed which lead to degraded user experience (see e.g. H. Fuchs and N. Farber. "Optimizing channel change time in IPTV applications", 2008 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, April 2008). Several solutions - mainly in industry- driven standardization bodies and to a lesser extent in the science community - were proposed. The solutions range from video coding improvements to feedback-based and server-based solutions. A good overview is provided by H. Fuchs and N. Farber. "Optimizing channel change time in IPTV applications", 2008 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, April 2008, who also briefly describe the delay contributors that dictate the tune-in time into a new channel. The most significant and ubiquitous source for delay is inherently caused by today's digital video coding schemes that do not allow to start decoding at any point in time.

Current digital video coding schemes like MPEG-2 video and H.264 are based on differential video coding. In these schemes coding efficiency is substantially increased through predictive coding (in so-called P-frames and B-frames) exploiting the temporal correlation between consecutive frames in a video sequence. P-frames use only backward- prediction from previous frames whereas B-frames may additionally use forward- prediction from subsequent frames. I-frames only depend on data inside the frame and enable decoding at the start of a sequence. Each frame consists of macroblocks, whereas an I-frame contains intra-coded macroblocks only. P and B-frames are mostly consisting of inter-coded macroblocks, i.e. macroblocks that depend on data of other frames and may also contain intra-coded macroblocks. The frame types of differential video coding do typically have a fixed sequence, which is an I-frame followed by several P and B-frames until the next I-frame. The range from an I-frame up to the next I-frame is called a Group of Pictures (GOP), as illustrated in Fig. 1.

Fig. 1 illustrates a conventional GOP structure of an encoded video stream. As can be seen from Fig. 1, the encoded video stream comprises a plurality of I-pictures (intra-coded pictures) and a plurality of P-pictures (inter-coded pictures). The encoded video stream comprises a plurality of consecutive frames which are labeled in Fig. 1 as frames 1 to 7 and the pictures are included within the frames. The encoded video stream comprises a plurality of groups of pictures (GOPs) each of which includes at least one I-picture and one or more P- or B-pictures. In the example shown in Fig. 1 , the GOP of the encoded stream comprises as a first frame in the GOP the I-frame, i.e. the frame including the intra-coded picture whereas the remaining frames 2 to 6 comprise the inter-coded pictures. At frame 7 a new GOP starts.

P- and B-frames allow also intra-coded macroblocks which may be used to improve the coding gain, i.e. not all macroblocks benefit from inter-frame prediction. Another benefit of intra-coded macroblocks is their positive effect on error resilience since errors due to missing or corrupt frames do only propagate over inter-coded macroblocks. During tune-in into a stream that is transmitted over a bitrate constraint channel, delay is caused by the random-access point acquisition time and the time that is required to fill the decoder input buffer for constant decoding without interrupts.

Random-access into a differentially coded video sequence may only be done at an I-frame since this frame is guaranteed to contain intra-coded macroblocks only. All following frames have potentially dependencies on the I- frame or other previous frames and a best- effort decoding of these frames would result in visible image distortion.

Thus, to reduce the delay caused by waiting for the next I-frame to arrive, the distance between random-access points for tune-in could be reduced, resulting in smaller GOPs. Hence more I-frames are beneficial but on the other hand reduce the coding efficiency, which is mainly accomplished by inter-frame prediction for a majority of the available video data.

Besides the random access-point also a certain amount of data is required in the decoder input buffer. This results from the fact that I-frames may easily require up to ten times more coded bits than P- and B-frames due to the potential coding gain of prediction. This behavior results in a highly variable encoded bitrate that may exceed the available bitrate on the channel during I-frame transmission. Therefore all modern video codecs define a strict buffer model that enables transmission of variable encoded streams on bandwidth constraint channels allowing both encoder and decoder to predict the required amount of data before decoding may be started. However, in order to reduce the buffering delay a more Constant Bitrate (CBR) behavior is desired.

IPTV suffers from slow channel change behavior similarly as digital TV broadcast systems. A disadvantage of IPTV is the high number of additional delay factors that are however of lesser importance than the influence of the video coding. In contrast, the advantage of IPTV is the flexibility by using multicast instead of broadest, higher available bitrates and the possibility of easily reconfigurable systems in the distribution chain. Hence, fast channel change in IPTV may be accomplished not only by reducing the size of a GOP but also by other methods that provide-random-access points and filled decoder buffers in short time.

Providing RAPs (Random Access Points) inside an IPTV video stream at a higher frequency is easy and an intuitive way to partly solve the problem of slow channel changes. However, this approach is a waste of network resources, especially when clients are in a steady state. One solution to this is to provide a side stream that is a different video encoding of the channel containing more frequent random access points than the normal encoding of the video, the main stream. The side stream is only forwarded to the client upon request, i.e. when the client tunes into a new channel. It is used until an RAP in the main stream is received. The client then switches to the main stream and drops the side stream. To minimize the additional bitrate necessary for the side stream, it's quality is reduced, e.g. by reducing the resolution or framerate (see e.g. J. M. Boyce and A. M. Tourapis. "Fast efficient channel change [set-top box applications]" in Consumer Electronics. 2005, ICCE 2005 Digest of Technical Papers, International Conference on, pp. 1-2, Jan. 2005). In addition, the main stream may utilize longer I-frame distances and thus have a higher compression gain.

Another variation of side streams contains I-frames only as RAPs to the main stream and splices one of these with the main stream. This is described in more detail below (see also U. Jennehag, T. Zhang, and S. Pettersson. "Improving Transmission Efficiency in H.264 based IPTV Systems", IEEE Transactions on Broadcasting, vol. 53, no. 1, pp. 69-78, March 2007, and U. Jennehag and S. Pettersson. "On Synchronization Frames for Channel Switching in a GOP-Based IPTV Environment", 5^th IEEE Consumer Communications and Networking Conference 2008, pp. 638 -642, Jan 2008). Tune-In Pictures (TIP) are an IPTV tune-in stream technology based on the idea that it is inefficient to send intra-coded frames at a fixed frequency to provide stream resynchronization points. With TIP, a stream consisting of stream RAPs, e.g. I-frames, is separately generated in addition to the main video stream. The main stream consists of a normal GOP structure with a large I-frame distance. Clients who want to decode a video stream, e.g. a TV-channel, must receive both the main video stream and one frame from the TIP stream for an instant RAP. The tune-in picture and the main stream are then spliced, i.e. a P-frame from the main stream is replaced by the corresponding TIP, and decoded. The following example, illustrated in Fig. 2, shows a typical TIP channel switch situation.

Fig. 2 illustrates an example for a channel change using tune-in pictures. Fig. 2 illustrates a first channel A comprising a main stream 100 comprising a plurality of frames including respective inter-coded pictures P_A and intra-coded pictures U. In Fig. 2 a time axis indicating times tl to tl3 is shown and during the time period tl to t4 the main stream 100 of channel A provides the frames for a first group of pictures GOP_{A I} - A second group of pictures GOP_A2 is provided starting from time instance t5. Further, channel A comprises a tune-in stream 102 which comprises a plurality of intra-coded pictures for the main stream 100 which are spaced evenly with respect to each other, i.e. every three time instances an intra-coded pictures is sent by the tune-in stream 102 of channel A.

Further, Fig. 2 shows the main stream 104 for channel B and the channel B tune-in stream 106. The main stream 104 of channel B comprises a plurality of intra-coded pictures I_B and a plurality of inter-coded P_B. The main stream 104 of B, in the situation shown in Fig. 2, comprises a first group of pictures GOP_B i that starts with an I-frame at time instance tl and that ends with an inter-coded frame at time instance tl2. At time instance tl3 a new group of pictures GOP_B2 starts with a new I-frame Iβ. The tune-in stream 106 comprises a plurality of intra-coded frames which are provided at time instances t4, Xl, tlO and tl3. Fig. 2 also shows a spliced stream 108 which is the stream that is presented to a decoder for decoding the encoded main stream. During time instances tl to t4, frames from the main stream 100 of channel A are within the spliced stream and supplied to the decoder for decoding.

The client requests the channel switch at time instance t5 and the main multicast stream of channel A is immediately left at time instance t5. Then, first the TIP stream for channel B is requested at time instance t5. The client waits until traffic arrives and receives the RAP at time instance t7. The TIP stream is left when the complete RAP is received and the main stream of channel B is now requested. The RAP of the TIP stream and the main stream are spliced and decoded at time instance t7 to generate a valid coded video sequence. Residual artifacts from decoder mismatch, i.e. the mismatch of reference pictures and the resulting drift, are removed when the next I-frame is encountered in the main stream.

One inherit problem with the TIP FCC (FCC = Fast Channel Change) approach is the mismatch between the tune-in picture and the corresponding picture in the main stream which produces a prediction error in the decoded stream. In addition, an easily visible jump in quality may also be observed when the quality stabilizes (see U. Jennehag and S. Pettersson. "On Synchronization Frames for Channel Switching in a GOP-Based IPTV Environment", 5^th IEEE Consumer Communications and Networking Conference 2008, pp. 638 -642, Jan 2008). This jump in quality may be reduced by using a tune-in picture with high quality which generates less prediction error. However, such approach requires a higher bitrate and reduces the overall performance of the system. The tune-in pictures do not impose any changes in the main stream which will typically still be coded with unwanted peaks in the bitrate.

Another tune-in approach is based on a gradual decoder refresh. The common principles of differential video coding are well known and reference is now made to a rarely used method, where intra-coded macroblocks are spread over differentially coded frames. Typical applications for this technique are intra-coded macroblocks for improved error- resilience, i.e. errors in motion prediction do not propagate (see E. Steinbach, N. Farber. und B. Girod. "Standard Compatible Extension of H263 for Robust Video Transmission", IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 6, pp. 872- 881, Dec. 1997) and the smoothing of I-frame bitrate peaks, which means that a frame is divided into several pieces and these pieces are updated in an interleaved manner as illustrated in Fig. 3. These pieces represent a partition of the frame and are called slices in H.264.

Fig. 3 illustrates, on the basis of the conventional GOP structure shown in Fig. 1 a gradually refreshing structure. In a similar manner as in Fig. 1, for the encoded video stream the frames 1 to 7 are shown. In accordance with the gradually refreshing structure each frame 1 to 7 comprises a plurality of partitions or slices a, b and c. In Fig. 3, frames 1 to 6 form one group of pictures and frame 7 indicates the start of a subsequent group of pictures. The encoded video stream comprises for each I-picture a plurality of intra-coded macroblocks and for each P-frame or B-frame a plurality of inter-coded macroblocks. In the example shown in Fig. 3, it is assumed that each frame comprises three macroblocks. The three macroblocks of the I-picture, other than in Fig. 1 , are not provided in a single frame, like frame 1 of Fig. 1, rather the intra-coded or I-macroblocks of the I-picture are distributed or spread among a plurality of frames. In the example of Fig. 3, a first I- macroblock or I-slice of the I-picture for the group of pictures comprising frames 1 to 6 is provided in slice a of frame 1. The second I-macroblock is provided in slice b of frame 2, and the third I-macroblock is provided in slice c of frame 3.

The bitrate smoothing is accomplished by using a method which is known as independent Segment decoding (ISD) with shifted intra-coded slices in H.263 (see ITU-T Rec. H.263, "Infrastructure of audiovisual services -Coding of moving video: Video coding for low bit rate communication", International Telecommunication Union, Jan 2005) or gradual decoder-refresh (GDR) in H.264.

In the conventional GDR sequence of Fig. 3 the frames are divided into three slices. A GDR sequence starts typically with one I-slice (intra-coded slice) and inter-coded slices which are only using forward prediction up to the next I-slice for the same image partition in the following two frames. In this example, the bits which would be necessary for the first frame if coded as an I-frame are distributed over frames one to three. However, GDR reduces coding efficiency for smaller frame sizes resulting in small slices, because the prediction choices are reduced significantly at slice boundaries. GDR is especially advantageous for low-delay video coding, where it is important that even over a bitrate limited channel the arrival time is predictable and also the required buffer sizes are to be small.

A first method for tune-in into a GDR requires starting decoding at the 1^st or 7^th frame in the example of Fig. 3 for an initially correct picture. An additional parameter has to be signaled, the pre-roll period, that describes how many frames after a random-access point have to be decoded for assured decoding, i.e. perfect reconstruction is achieved (see G. Sullivan. "On Random Access and Bitstream Format for NT Video", Joint Video Team (NT) of ISO/IEC MPEG 4ITU-T VCEG, NT-8063, Jun 2002). In the case of the example of Fig. 3 a pre-roll period of three frames has to be used.

The second method for tune-in into a GDR stream is known as best-effort decoding (see G. Sullivan. "On Random Access and Bitstream Format for NT Video", Joint Video Team (NT) of ISO/IEC MPEG 4ITU-T VCEG, NT-8063, Jun 2002) and illustrated in Fig. 4.

Fig. 4 describes a GDR structure for a best effort decoding. In a similar manner as Fig. 3, each frame comprises a plurality of slices a, b, and c. Other than in Fig. 3, in the structure for the best effort decoding the I-macroblocks for an I-picture of a GOP are distributed in frames which are separated from each other by at least one frame. More specifically, the first I-macroblock is provided in slice a of frame 1 , the second I-macroblock is provided in slice b of frame 3 and the third I-macroblock is provided in slice c of frame 5. Preferably, as is shown in Fig. 1 , the I-slices or I-macroblocks are evenly distributed among the frames of a GOP.

Decoding may start at any frame containing an I-slice and missing referenced macroblocks are initialized with either a certain color (e.g. mid-level gray Y=Cb=Cr= 128) or any other defined value. It is assumed that both intra-coded macroblocks follow soon and enough redundancy is still available in following inter-coded macroblocks so that the image quality converges after some frames. However, the fact that all but one slice are initialized with mid-level gray, leads to "dirty random access", which is an essential shortcoming prohibiting the introduction of GDR in high-quality IPTV.

It is an object of the invention to provide an improved approach allowing fast tuning into a differentially encoded video stream.

This object is solved by a method of claim 1, by an encoder of claim 6, by a method of claim 7, and by a decoder of claim 11.

Embodiments of the invention provide a method for providing a tune-in stream for an encoded video stream of a plurality of intra-coded pictures and a plurality of inter-coded pictures, each picture comprising a plurality of macroblocks, the encoded video stream comprising a plurality of frames, wherein the plurality of macroblocks of an intra-coded picture are spread among a plurality of the frames, the method comprising:

providing a tune-in stream comprising a plurality of tune-in pictures, wherein a tune-in picture, is provided for a frame of the encoded video stream that comprises an intra-coded macroblock of an intra-coded picture,

wherein the tune-in picture comprises the remaining intra-coded macroblocks of the intra-coded picture.

Further, embodiments of the invention provides an encoder for providing a tune-in stream for an encoded video stream of a plurality of intra-coded pictures and a plurality of inter- coded pictures, each picture comprising a plurality of macroblocks, the encoded video stream comprising a plurality of frames, wherein the plurality of macroblocks of an intra- coded picture are spread among a plurality of the frames, wherein the encoder is configured to provide a tune-in stream comprising a plurality of tune-in pictures, wherein a tune-in picture, is provided for a frame of the encoded video stream that comprises an intra-coded macroblock of an intra- coded picture,

Further, embodiments of the invention provide a method for tuning into an encoded video stream, the method comprising:

providing an encoded video stream of a plurality of intra-coded pictures and a plurality of inter-coded pictures, each picture comprising a plurality of macroblocks, the encoded video stream comprising a plurality of frames, wherein the plurality of macroblocks of an intra-coded picture are spread among a plurality of the frames;

providing a tune-in stream comprising a plurality of tune-in pictures, wherein a tune-in picture, is provided for a frame of the encoded video stream that comprises an intra-coded macroblock of an intra-coded picture, wherein the tune-in picture comprises the remaining intra-coded macroblocks of the intra-coded picture;

upon receiving a tune-in request, tuning into the encoded video stream and the tune-in stream;

upon receiving a tune-in picture, in the tune-in stream, splicing the tune-in picture and the encoded video stream.

Further, embodiments of the invention provide a decoder for receiving encoded data and providing decoded output data, the encoder comprising:

an input for receiving an encoded video stream of a plurality of intra-coded pictures and a plurality of inter-coded pictures, each picture comprising a plurality of macroblocks, the encoded video stream comprising a plurality of frames, wherein the plurality of macroblocks of an intra-coded picture are spread among a plurality of the frames, and a tune-in stream comprising a plurality of tune-in pictures, wherein a tune-in picture, is provided for a frame of the encoded video stream that comprises an intra-coded macroblock of an intra-coded picture, wherein the tune-in picture comprises the remaining intra-coded macroblocks of the intra-coded picture; and

a decoding portion coupled to the input and configured to tune into the encoded video stream and the tune-in stream upon receiving a tune-in request, upon receiving a tune-in picture in the tune-in stream, to splice the tune-in picture and the encoded video stream, and to decode the spliced stream.

Embodiments of the invention provide a computer readable medium for storing instructions for executing the methods in accordance with embodiments of the invention.

The invention provides a novel approach for tuning into a main stream, e.g. for a fast channel change, based on tune-in streams. Embodiments of the invention are based on a gradual decoder refresh (GDR) in accordance with H.264 for the main stream (the differentially encoded video stream) and for the tune-in stream. In normal GDR mode a receiver would have to wait for several received NAL units of multiple frames before it may render a complete frame without visible GDR artifacts. An additional tune-in stream is provided that fills the missing regions of a partially complete frame. The simulation results discussed later show that the solution is at least as bitrate-efficient as previous tune- in solutions but offers the advantages of a less variably encoded bitrate and a consistent gradual improvement of picture quality.

Embodiments of the invention are advantageous as the random-access point (RAP) acquisition time is greatly improved using a side-stream. In addition a reduction in decoder pre-buffer size requirements is obtained.

Embodiments of the invention concern the combination of the tune-in picture technology and best-effort gradual decoder refresh. A GDR stream as described above with regard to Fig. 4 is used for the main stream und in addition a tune-in stream is provided that comprises the missing slices needed for tune-in with a complete I-picture. This overcomes the drawback "dirty random access" for best-effort GDR streams and conserves the advantageous properties of GDR streams. This approach is hereafter referred to also as Gradual Tune-In Pictures (G-TIP). For this embodiment, the best-effort GDR is considered as a viable alternative encoding variant for fast tune-in due to its property of equally distributed RAPs though the decoder output is a partially incorrect image initially. In addition this structure brings the advantage of a relatively equally distributed bitrate which facilitates the signaling of a level with a smaller buffer requirement and hence a lower initial pre-buffering delay. This aspect is also important as the buffer requirements are also a key factor of tune-in delay.

In accordance with embodiments of the invention the encoded video stream and the tune-in stream are provided to a receiver directly, for example from a service provider, or via a server from which the respective streams can be obtained upon user request. Embodiments of the invention teach a method in accordance with which each frame of the encoded video stream comprises a plurality of partitions, wherein frames comprising an intra-coded macroblock only comprise a single inter-coded macroblock in one of its partitions, and the remaining intra-coded macroblock for the inter-coded picture are comprised within respective following frames. In accordance with embodiments the intra-coded macroblocks are provided in consecutive frames or in frames having there between one or more frames without intra-coded macroblocks.

In accordance with embodiments of the invention, the encoded video stream comprises a plurality of groups of pictures (GOPs), each of which comprises at least one intra-coded picture and one or more inter-coded pictures, wherein each GOP comprises a plurality of frames, and wherein the intra-coded macroblocks are spread among the plurality of frames of the GOP. In accordance with an embodiment, the intra-coded macroblocks are spread evenly among the plurality of frames of the GOP.

In the following, embodiments of the invention will be described in further detail on the basis of the accompanying drawings, in which:

Fig. 1 shows a conventional GOP structure of a differentially encoded video stream;

Fig. 2 shows an example of a channel change on the basis of tune-in pictures;

Fig. 3 shows a GDR-structure of a differentially encoded video stream;

Fig. 4 shows a GDR-structure for best effort decoding;

Fig. 5 illustrates an approach for tuning into a main stream using gradual tune-in pictures in accordance with an embodiment of the invention; Fig. 6 shows an encoder/decoder set up for illustrating a system using the inventive approach for fast tune-in; and

Figs. 7-10 show the tune-in PSNR traces for different video sequences.

In accordance with embodiments of the invention, a novel approach for fast tuning into a main stream, for example, for a fast channel change based on a tune-in stream, is provided. The novel approach is based on the combination of the above described tune-in picture technology and the gradual decoder refresh. A GDR stream is used as the main stream and in addition a tune-in stream is provided comprising a plurality of tune-in pictures, wherein the tune-in pictures comprise the missing slices needed for tune-in with a complete picture. Fig. 5 shows an embodiment of the invention using a combination of a tune-in picture technology and the best effort decoding GDR stream. Fig. 5 shows at 200 the main stream or video encoded stream. This video encoded stream comprises a plurality of frames 1 to 7 of which frames 1 to 6 form a first group of pictures GOPi and with frame 7 a new group of pictures GOP₂ starts. Each of the frames comprises three slices a, b and c and the intra- coded macroblocks for an I-picture of the group of pictures GOPj are evenly distributed across the frames of group GOP₁, i.e. are evenly distributed among the plurality of frames 1 to 6. In the embodiment shown in Fig. 5 the intra-coded macroblocks for the I-picture of the group GOP₁ are provided in slice a of frame 1, in slice b of frame 3 and slice c of frame 5.

Fig. 5 also shows the tune-in stream 202 in accordance with the invention. The tune-in stream comprises a plurality of tune-in pictures 202i to 202₄ which are provided with a predefined interval. In the example shown in Fig. 5 the tune-in pictures 202i to 202₄ are provided for every second frame of the main stream 200, i.e. at frame positions 1, 3, 5, 7. The tune-in pictures 202 _\ to 202₄ have a similar structure as the frames 1 to 7 of the main stream 200, i.e. each tune-in picture comprises three slices a, b, and c. In accordance with embodiments of the invention, the tune-in pictures 202₁ to 202₄ are provided for those frames in the main stream which include an I-macroblock, i.e. in the embodiment of Fig. 5 for frames 1, 3, 5 and 7. The tune-in pictures further comprise those I-macroblocks which are missing from the associated frame in the main stream, i.e. these tune-in pictures comprise the "remaining" I-macroblocks. To be more specific, the tune-in picture 2021 associated with frame 1 of the main stream 200 comprises the I-macroblocks that are present in the main stream in frame 3 at slice b and in frame 5 at slice c. In a similar manner, tune-in picture 202₂ comprises those I-macroblocks missing from main stream frame 3, namely the I-macroblock from slice a of frame 1 of the main stream and the I- macroblock of slice c of frame 5 of the main stream 200. The same is true for tune-in pictures 202₃ and 202₄.

Further, Fig. 5 shows the spliced stream 204. In a similar manner as described above with regard to Fig. 2, upon requesting the tune-in into the main stream, for example at the position where frame 2 is presented, the tune-in stream 202 is obtained and the main stream 200 is obtained, however, at frame position 2 no decoding or splicing can occur as no inter-coded information is available here. However, at frame position 3 by splicing the tune-in picture 202₂ and frame 3 from the main stream 200 the spliced stream comprises a complete I-frame that allows starting decoding the main stream.

While Fig. 5 shows an embodiment in accordance with which the I-macroblocks are evenly distributed among the plurality of frames of the GOP₁, it is noted that the inventive approach is also applicable to main streams having a different distribution of the I- macroblocks among the frames within a GOP₁, for example the inventive approach may also be applied to a GDR structure as shown in Fig. 3. In such a situation, the tune-in stream would comprise three consecutive tune-in pictures having a structure as pictures 202 _\ to 202₃. Also, it is possible to distribute the I-macroblocks among the plurality of frames with greater distances there between or with different distances there between, i.e. the number of frames between two successive I-macroblocks may vary.

Further, Fig. 5 shows an embodiment in accordance with which the I-macroblocks are provided such that in a first frame of a GOP the macroblock for a first slice, then a macroblock for second slice and then an I-macroblock for a third slice is provided. The invention is not limited to such an approach, rather, the order in which the I-macroblocks are provided with regard to the slice position is arbitrary as long as the associated tune-in pictures provide for the remaining (missing) I-macroblocks of the associated main stream frame. For example, in Fig. 5 frame 1 may comprise the I-macroblock also in slice b or in slice c and in the remaining frames 3 and 5 the macroblocks of the other slices would be provided.

Further, while Fig. 5 shows an embodiment of frames having three slices it is noted that the invention is not limited to such an approach, rather a plurality of slices should be provided, i.e. two or more slices, for example five slices as we will discuss below with regard to experimental results obtained by applying the inventive tune-in approach to test scenes.

Fig. 6 is a schematical representation of a system comprising an encoder 300 and a decoder 400. The encoder 300 comprises an input 302 receiving video information to be encoded, for example video information in the form of YUV-data. This data is provided to a main encoder 304 and, in parallel, to a tune-in encoder 306. The main encoder 304 provides the encoded video stream or main stream in a manner as described above with regard to Figs. 3, 4 and 5, and the tune-in encoder provides the tune-in stream in accordance with embodiments of the invention, for example in a manner as described above with regard to Fig. 5 by providing tune-in pictures having those I-macroblocks missing from an associated main frame block. The encoder 300 comprises an output for providing both main stream and the tune-in stream together, for example to a communication network or the like, as it is schematically illustrated by the arrows leaving the blocks labeled main encoder and tune-in encoder.

The decoder 400 comprises a splice portion 402 and a decoder portion 404. An input of the decoder 400 receives the combined mainstream and tune-in stream and inputs same to the splice portion 402 which operates upon receiving a tune-in request in a manner as described above with regard to Fig. 5. The spliced stream 204 shown in Fig. 5 is applied from the splice portion 402 to the decoder portion 404 so that the decoder 400 provides at its output 406 the decoded video stream.

To demonstrate the advantages of the invention over conventional approaches an experimental set up similar to the one in Fig. 6 was made, and in this experimental set up the encoder 300 and the decoder 400 are directly connected with each other and the output of the decoder portion 404 is fed to a PSNR measurement device 500 (PSNR = peak- signal-noise-ratio). Further, the original input signal 302 is also applied to block 500 via line 502 to evaluate the coding/decoding efficiency. In accordance with the experimental set up a main stream and a tune-in stream in accordance with embodiments of the invention was generated and compared to a main stream and a tune-in stream provided in accordance with conventional approaches (see Fig. 2).

Using the experimental setup described in Fig. 6, a set of sequences are encoded into a main stream and tune-in stream. The resulting frames are spliced and decoded and the objective quality for the tune-in period is measured with luminance PSNR. The tune-in period of the spliced stream is defined as the period starting at the first decoded frame until the frame where the quality has stabilized, i.e. a complete refresh from the main stream has occurred for the G-TIP and the next complete I-frame for the TIP. The above described approach is used for both the TIP and G-TIP scenario.

Four sequences from the NITA/ITS selection where retrieved from the Video Quality Experts Group FTP server to be used. The first sequence is a slow moving scene of leafs on a tree blowing in the wind (Aspen). Sequence two (Red Kayak) is a part of a kayak in Whitewater. The third sequence is a static clip of a snow covered mountain side surrounded by slow moving clouds (SnowMnt). The final clip is an American football kickoff (TouchdownPass) which includes moving players and some accelerating panning. All sequences are 100 frames long and have a resolution of 1280 X 720 pixels at 25 frames per second.

The main stream for TIP and G-TIP both use a fixed quantizer parameter of 24 and the number of slices per frame is fixed to five. In addition the I-frame distance in the TIP encodings is set to 25 frames equaling one complete IDR frame per second. The encoding of the G-TIP streams is chosen such that the number of I-slices per second is identical to the TIP streams. Five intra-coded slices are distributed in a way similar to Fig. 4, i.e. each 5^th frame contains one I-slice and four P-slices.

The tune-in pictures used for G-TIP and TIP are encoded with a low quality, using the fixed quantizer parameter 45. The encodings of all G-TIP and TIP sequences are done with the JM 14.2 encoder, which was slightly modified for G-TIP encoding where applicable. The JM H.264 reference software (see Karsten Sϋhring, "IP Homepage - H.264/AVC JM Reference Software", Dec 2008, http://iphone.hhi.de/suehring.tml/) was modified for GDR and several encodings of publicly available test sequences in HD resolution was made as discussed below. As will be seen in the following discussion, the bitrate overhead for similar quality was below 0.5% in the investigated sequences.

Eight different scenarios are investigated for which the tune-in quality is studied. The scenarios consist of the four sequences with both G-TIP and TIP tune-in. The tune-in position is set to frame 30 with respect to the main stream. This tune-in frame number is used for all the investigated scenarios. Hence, frame 30 to 49 represent the transition period. From frame 50 onwards only the main stream is decoded.

For all encodings the tune-in picture and the main-stream was spliced offline and fed to several decoders that all were able to decode the spliced stream.

For better comparison of the TIP and G-TIP tune-in quality, the longest possible transition period was chosen, as this is the worst-case scenario for TIP, which is for the investigated scenario a tune-in 20 frames before the next I- frame of the main stream.

The results from the experiment described above are now discussed. Figs. 7-10 shows the luminance PSNR plots for G-TIP and TIP approaches for the four investigated sequences. The main difference between the results for the different sequences is caused by the selection of the actual scenes. This is most noticeable when comparing the results of the SnowMnt (Fig. 9) and Red Kayak (Fig. 8) sequences. SnowMnt is a very static scene with almost no movement with a high percentage of predicted macroblocks. The Red Kayak sequence on the other hand, includes a scene with a lot of movement and fast moving details which requires a high percentage of the macroblocks to be intra-coded which means that the initial prediction error caused by the tune-in stream is quickly corrected. Note the "staircase" effect in the SnowMnt and Touchdown Pass sequences. This is the result of gradual introduction of the intra-coded slices in the main stream which clearly shows that the next level in quality is reached with the next GDR I-slice.

G-TIP provides a gain compared to TIP ranging from 0.47-4.IdB average frame PSNR difference for the investigated scenarios, Table I shows the mean frame PSNR difference values for the tune-in period.

Table 1 : Mean frame PSNR differences

TIP performance is coupled to the transition period length, i.e. the number of frames to the next I-frame in the main stream. Further, an informal subjective test also indicated that G- TIP is superior to TIP.

Embodiments of the invention provide tune-in streams and gradual decoder refresh to enable fast tune-in for IPTV. Tune-in pictures are an easily applicable technique for a fast tune-in solution. However, there's a steep quality jump from the low-quality tune-in stream to the high-quality main stream. Even for TIPs received close to an I-frame, there will be an easily visible steep quality jump with the single exception of an I-frame of the main stream, that is received right from the beginning. Gradual decoder refresh with best-effort decoding on the other hand is advantageous due to its less variably encoded bitrate property in combination with a high number of possible random-access points. However, it suffers from visible artifacts due to missing reference pictures. To overcome this disadvantage, embodiments of the invention provide the combination of tune-in streams and gradual decoder refresh. The quality jump is reduced by a gradually refreshing main stream and a corresponding tune-in stream that provides intra-coded slices where normally mid-level gray would be assumed. The above discussed results show that the bitrate overhead of gradual decoder refresh in general is negligible for high resolutions. The additional bitrate needed for the tune-in stream is comparable to the bitrate of a tune-in picture based stream and offers higher quality during the tune-in period. It also has the advantage that the transition from the tune-in stream to the main stream is predictable in terms of quality improvement Tor any channel change event.

Embodiments of the invention concern approached for tuning into a stream which comprises as a main stream an encoded video stream and a tune-in stream, wherein the stream may be a single stream which is provided to a user, e.g. over a network, like the Internet. The stream containing e.g. a video content may be provided by a service provider such that a user may tune into the stream at any time. In such a situation, after receiving the tune-in request the stream including both the main stream and the tune-in stream is received by the user. In another embodiment of the invention, the stream is obtained by a user on the user's demand, e.g. from a service provider. The stream (e.g. video on demand) is received by the user and when tuning into the stream decoding of the stream starts after obtaining the tune-in picture from the tune-in stream and splicing the main stream and the tune-in picture.

In accordance with other embodiments of the invention the encoded video stream and the tune-in stream are associated with a channel of a multi-channel transmission system, and a tune-in request indicates a change from a current channel of the multi-channel transmission system to a new channel of the multi-channel transmission system.

In the description of the embodiments of the invention, the self-contained blocks and the non-self-contained blocks of the streams were named as I-pictures and P- or B-pictures, respectively. It is noted, that the term "picture", in general, determines an encoded content that includes data or information that is necessary to decode the content of the block. In case of I-pictures all data or information is included that is necessary to decode the complete content of the block, whereas in case of P- or B-pictures not all information is included that is necessary to decode a complete picture, rather additional information from preceding or following pictures is required.

Although some aspects of the invention were described in the context of an apparatus, it is noted that these aspects also represent a description of the corresponding method, i.e., a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.

The combined signal comprising the encoded video stream and the tune-in stream may be stored on a digital storage medium or may be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.

Depending on certain implementation requirements, embodiments of the invention may be implemented in hardware or in software. The implementation may be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an

EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Other embodiments of the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

Further, embodiments of the invention may be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier. Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.

A further embodiment of the invention is a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.

Yet a further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.

The above described embodiments are merely illustrative for the principles of the invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.

Claims

1. A method for providing a tune-in stream (202) for an encoded video stream (200) of a plurality of intra-coded pictures and a plurality of inter-coded pictures, each picture comprising a plurality of macroblocks, the encoded video stream (200) comprising a plurality of frames, wherein the plurality of macroblocks of an intra- coded picture are spread among a plurality of the frames, the method comprising:

providing a tune-in stream (202) comprising a plurality of tune-in pictures (202₁- 202₄), wherein a tune-in picture (202]-202₄), is provided for a frame of the encoded video stream (200) that comprises an intra-coded macroblock of an intra-coded picture,

wherein the tune-in picture (202r202₄) comprises the remaining intra-coded macroblocks of the intra-coded picture.

2. The method of claim 1, wherein each frame of the encoded video stream (200) comprises a plurality of partitions (a, b, c), wherein frames (1, 3, 5, 7) comprising an intra-coded macroblock of an intra-coded picture comprise a single intra-coded macroblock in one of its partitions (a, b, c), and wherein the remaining intra-coded macroblocks for the intra-coded picture are comprised within respective following frames of the encoded video stream (200).

3. The method of claim 2, wherein the intra-coded macroblocks of the intra-coded picture are provided in consecutive frames of the encoded video stream, or are provided in frames (1, 3, 5, 7) of the encoded video streams (200), having there between one or more frames (2, 4, 6) without intra-coded macroblocks.

4. The method of one of claims 1 to 3, wherein the encoded video stream (200) comprises a plurality of groups of pictures (GOPs), each GOP comprising at least one intra-coded picture and one or more inter-coded pictures, wherein each GOP comprises a plurality of frames (1, 2, 3, 4, 5, 6), and wherein the intra-coded macroblocks of the at least one intra-coded picture are spread among the plurality of frames (1, 2, 3, 4, 5, 6) of the GOP.

5. The method of claim 4, wherein the intra-coded macroblocks are spread evenly among the plurality of frames (1, 2, 3, 4, 5, 6) of the GOP.

6. An encoder for providing a tune-in stream (202) for an encoded video stream (200) of a plurality of intra-coded pictures and a plurality of inter-coded pictures, each picture comprising a plurality of macroblocks, the encoded video stream (200) comprising a plurality of frames, wherein the plurality of macroblocks of an intra- coded picture are spread among a plurality of the frames,

wherein the encoder (300) is configured to provide a tune-in stream (202) comprising a plurality of tune-in pictures (202₁ -2024), wherein a tune-in picture (202i-202₄), is provided for a frame of the encoded video stream (200) that comprises an intra-coded macroblock of an intra-coded picture,

wherein the tune-in picture (202i-202₄) comprises the remaining intra-coded macroblocks of the intra-coded picture.

7. A method for tuning into an encoded video stream (200), the method comprising:

providing an encoded video stream (200) of a plurality of intra-coded pictures and a plurality of inter-coded pictures, each picture comprising a plurality of macroblocks, the encoded video stream (200) comprising a plurality of frames, wherein the plurality of macroblocks of an intra-coded picture are spread among a plurality of the frames;

providing a tune-in stream (202) comprising a plurality of tune-in pictures (202₁- 202₄), wherein a tune-in picture (202r202₄), is provided for a frame of the encoded video stream (200) that comprises an intra-coded macroblock of an intra-coded picture, wherein the tune-in picture (202^202₄) comprises the remaining intra- coded macroblocks of the intra-coded picture;

upon receiving a tune-in request, tuning into the encoded video stream (200) and the tune-in stream (202);

upon receiving a tune-in picture, (202₂) in the tune-in stream (202), splicing the tune-in picture (202₂) and the encoded video stream (200).

8. The method of claim 7, further comprising decoding the spliced stream.

9. The method of claim 7 or 8, wherein the encoded video stream (200) and the tune- in stream (202) are associated with one of a channel of a multi-channel transmission system, wherein the tune-in request indicates a change from a current channel of the multi-channel transmission system to a new channel of the multi-channel transmission system, or

with a stream, wherein the tune-in request initiates an initial tuning into the stream, or

a stream, which is obtained on demand of a user, wherein the tune-in request initiates an initial tuning-in to the stream.

10. The method of one of claims 7 to 9, wherein the encoded video stream (200) and the tune-in stream (202) are provided to a receiver directly or via a server.

11. A decoder for receiving encoded data and providing decoded output data, the encoder comprising:

an input for receiving an encoded video stream (200) of a plurality of intra-coded pictures and a plurality of inter-coded pictures, each picture comprising a plurality of macroblocks, the encoded video stream (200) comprising a plurality of frames, wherein the plurality of macroblocks of an intra-coded picture are spread among a plurality of the frames, and a tune-in stream (202) comprising a plurality of tune-in pictures (202r202₄), wherein a tune-in picture (202_!-202-O, is provided for a frame of the encoded video stream (200) that comprises an intra-coded macroblock of an intra-coded picture, wherein the tune-in picture (202i-202₄) comprises the remaining intra-coded macroblocks of the intra-coded picture; and

a decoding portion (404) coupled to the input and configured to tune into the encoded video stream (200) and the tune-in stream (202) upon receiving a tune-in request, upon receiving a tune-in picture (202₂) in the tune-in stream (202), to splice the tune-in picture (202₂) and the encoded video stream (200), and to decode the spliced stream.

12. A computer readable medium for storing instructions which, when being executed by a computer, carry out a method of one of the claims 1 to 5 and/or a method of one of claims 7 to 10.