WO2016093752A1

WO2016093752A1 - Stream access for adaptive streaming of video

Info

Publication number: WO2016093752A1
Application number: PCT/SE2014/051479
Authority: WO
Inventors: Jonatan Samuelsson; Martin Pettersson; Usman HAKEEM
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2014-12-10
Filing date: 2014-12-10
Publication date: 2016-06-16

Abstract

There are provided mechanisms for creating multiple representations of a video sequence. The method comprises creating (S1) a first representation of the video sequence with a periodic access point structure, wherein the access point serves as a random access point or facilitates switching between different representations of the video sequence. The method further comprises creating (S2) a second representation of the video sequence with a sparser access point occurrence compared to the access point occurrence in the first representation and with an aperiodic or periodic access point structure, wherein the second representation has a higher quality and/or higher bitrate than the first representation. The method also comprises (S3) providing information on positions and/or time instances of switching points for the created representations, wherein the switching points facilitate switching from the first representation to the second representation.

Description

STREAM ACCESS FOR ADAPTIVE STREAMING OF VIDEO TECHNICAL FIELD

The embodiments generally relate to adaptive streaming of video, and in particular to enabling switching to a video sequence by creating multiple representations of a video sequence and to accessing a video sequence at a position that is not a start of the video sequence.

BACKGROUND

Video streaming applications and services of today, e.g. Netflix and Hulu, are predominantly based on adaptive streaming technology in order to handle a diversity of devices and to be able to adapt to varying network conditions. Adaptive streaming provides the best possible quality of experience to the consumers as the media server provides options for adapting to any changes in each user's network and playback conditions. An example of adaptive streaming schemes is Dynamic Adaptive Streaming over HTTP (DASH, or MPEG-DASH), which is the first adaptive bitrate HTTP-based streaming solution that is an international standard. Other examples of adaptive streaming schemes include HTTP Live Streaming (HLS), Smooth Streaming and HTTP Dynamic Streaming (HDS).

With adaptive streaming, content (e.g. a video sequence, audio etc.) is encoded in several different versions (representations), with different bitrate and quality levels, and each of these representations is split up into segments of for example 10 seconds. A client that receives the encoded video automatically selects from the available representations the next segment to download and play back based on current network conditions. Typically, the client selects the segment with the highest bitrate possible that can be downloaded in time for playback without causing stalls or rebuffering events in the playback.

In order to provide the functionality of switching between different representation and in order to be able to navigate (jump, random access) in the video, each encoded representation in state-of-the-art contains Stream Access Points that usually consist of intra coded pictures that facilitate switching and with the property that no pictures that follow the intra picture in both decoding order and output order depends on anything that precedes the intra picture in output order. Intra pictures are included periodically at least every 10^th second to provide reasonable interval for representation switching and random access. For fixed fiber-to-the-home internet connections and for Long Term Evolution (LTE) connections with good coverage it is fair to assume that the bandwidth available for the video streaming will most of the time be sufficient to consume the highest quality level suitable for the device on which the video is played. Only during shorter period of time will the client switch down to lower resolutions in order to cope with temporary reduced bandwidth after which receiving the highest quality segments will be resumed.

SUMMARY

One major problem with periodically inserted intra pictures is a bitrate increase. Namely, intra pictures do not use any temporal prediction (contrary to inter pictures, such as predictive (P) and bi-predictive (B) pictures) and do therefore require substantially higher bitrate than inter pictures to achieve the same quality. It is not uncommon that an intra picture requires 10 times the number of bits of an inter picture. This is not a desirable feature, especially for the highest quality representation that is used most of the time.

Thus, in order to reduce the average bitrate consumption and the capacity demands on the backbone network, it is of utter importance that the highest quality representation is compressed with high compression efficiency. It is a general objective to provide high compression efficiency for the highest quality representation. This and other objectives are met by embodiments as disclosed herein.

A first aspect of the embodiments defines a method of creating multiple representations of a video sequence. The method comprises creating a first representation of the video sequence with a periodic access point structure, wherein the access point serves as a random access point or facilitates switching between different representations of the video sequence. The method comprises creating a second representation of the video sequence with a sparser access point occurrence compared to the access point occurrence in the first representation. The second representation has an aperiodic or periodic access point structure. The second representation has a higher quality and/or higher bitrate than the first representation. The method comprises providing information on positions and/or time instances of switching points for the created representations, wherein the switching points facilitate switching from the first representation to the second representation. A second aspect of the embodiments defines an encoder for creating multiple representations of a video sequence. The encoder comprises processing means operative to create a first representation of the video sequence with a periodic access point structure. The access point serves as a random access point or facilitates switching between different representations of the video sequence. The encoder comprises processing means operative to create a second representation of the video sequence with an aperiodic or periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation. The second representation has a higher quality and/or higher bitrate than the first representation. The encoder comprises processing means operative to provide information on the positions and/or time instances of switching points for the created representations, wherein the switching points facilitate switching from the first representation to the second representation.

A third aspect of the embodiments defines a computer program for creating multiple representations of a video sequence, the computer program comprising code means which, when run on a computer causes the computer to create a first representation of the video sequence with a periodic access point structure, wherein the access point serves as a random access point or facilitates switching between different representations of the video sequence. The computer program comprises code means which, when run on a computer causes the computer to create a second representation of the video sequence with an aperiodic or periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation, wherein the second representation has a higher quality and/or higher bitrate than the first representation. The computer program comprises code means which, when run on a computer causes the computer to provide information on the positions and/or time instances of switching points for the created representations, wherein the switching points facilitate switching from the first representation to the second representation.

A fourth aspect of the embodiments defines a computer program product comprising computer readable means and a computer program, according to the third aspect, stored on the computer readable means.

A fifth aspect of the embodiments defines a method for accessing a video sequence at a position n that is not a start of the video sequence. The video sequence comprises at least a first representation and a second representation. The first representation has a periodic access point structure and the second representation has an aperiodic or a periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation. The access point serves as a random access point or facilitates switching between different representations of the video sequence. The second representation has a higher quality and/or higher bitrate than the first representation. The method comprises determining if the second representation has an access point at position n. The method comprises accessing the second representation at position n when the second representation has an access point at position n and accessing the first representation at position n and switching from the first representation to the second representation at a position following position n when the second representation does not have an access point at position n. A sixth aspect of the embodiments defines an arrangement for accessing a video sequence in adaptive streaming at a position n that is not the start of the video sequence. The video sequence comprises at least a first representation and a second representation, wherein the first representation has a periodic access point structure and the second representation has an aperiodic or a periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation. The access point serves as a random access point or facilitates switching between different representations of the video sequence. The second representation has a higher quality and/or higher bitrate than the first representation. The arrangement comprises processing means operative to determine if the second representation has an access point at position n. The arrangement comprises processing means operative to access the second representation at position n when the second representation has an access point at position n and to access the first representation at position n and to switch from the first representation to the second representation at a position following position n when the second representation does not have an access point at position n.

A seventh aspect of the embodiments defines a computer program for accessing a video sequence at a position n that is not the start of the video sequence. The video sequence comprises at least a first representation and a second representation, wherein the first representation has a periodic access point structure and the second representation has an aperiodic or a periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation. The access point serves as a random access point or facilitates switching between different representations of the video sequence. The second representation has a higher quality and/or higher bitrate than the first representation. The computer program comprises code means which, when run on a computer causes the computer to determine if the second representation has an access point at position n. The computer program comprises code means which, when run on a computer causes the computer to access the second representation at position n when the second representation has an access point at position n and to access the first representation at position n and switch from the first representation to the second representation at a position following position n when the second representation does not have an access point at position n. An eighth aspect of the embodiments defines a computer program product comprising computer readable means and a computer program, according to the seventh aspect, stored on the computer readable means.

Advantageously, this provides high compression efficiency for the highest quality representation.

Further advantageously, it is possible to encode the highest quality representation without periodic intra pictures.

Further advantageously, decreased bandwidth for the highest quality representation results in decreased average bitrate consumption.

It is to be noted that any feature of the first, second, third and fourth aspects may be applied to any other aspect, whenever appropriate. Likewise, any advantage of the first aspect may equally apply to the second, third and fourth aspect respectively, and vice versa. Other objectives, features and advantages of the enclosed embodiments will be apparent from the following detailed disclosure, from the attached dependent claims and from the drawings.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the element, apparatus, component, means, step, etc." are to be interpreted openly as referring to at least one instance of the element, apparatus, component, means, step, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated. BRIEF DESCRIPTION OF THE DRAWINGS

The invention, together with further objects and advantages thereof, may best be understood by making reference to the following description taken together with the accompanying drawings, in which: Fig. 1 illustrates a streaming scenario between an adaptive streaming server and an adaptive streaming client.

Figs. 2 and 4 illustrate a flowchart of a method of creating multiple representations of a video sequence according to embodiments of the present invention.

Figs. 3 and 10 illustrate an example of the first and the second representation according to embodiments of the present invention. Figs. 5, 6, 11 and 12 illustrate examples of the first, second and third representation according to embodiments of the present invention.

Fig. 7 illustrates how different parts of a picture are encoded as intra or inter blocks depending on whether these parts use reference from a previous picture or not.

Fig. 8 illustrates a possible coding dependency between representations used for switching.

Fig. 9 illustrates a flowchart of a method of accessing a video sequence at a position n that is not a start of the video sequence according to embodiments of the present invention.

Fig. 13 illustrates an example of the second and the third representation according to embodiments of the present invention.

Figs. 14 and 16 is a schematic block diagram illustrating functional units of an encoder for creating multiple representations of a video sequence according to an embodiment.

Fig. 15 is a schematic block diagram illustrating a computer comprising a computer program product with a computer program for creating multiple representations of a video according to an embodiment. Figs. 17 and 19 is a schematic block diagram illustrating functional units of an arrangement for accessing a video sequence at a position X that is not a start of the video sequence according to an embodiment. Fig. 18 is a schematic block diagram illustrating a computer comprising a computer program product with a computer program for accessing a video sequence at a position X that is not a start of the video sequence according to an embodiment. DETAILED DESCRIPTION

The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments of the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the art to make and use the invention. Throughout the drawings, the same reference numbers are used for similar or corresponding elements.

Throughout the description, the terms "video" and "video sequence" and "intra picture" and "l-picture" are interchangeably used. The present embodiments generally relate to adaptive streaming of video, and in particular to creating multiple representations of a video sequence and to accessing a video sequence at a position that is not a start of the video sequence.

Fig. 1 illustrates a streaming scenario between an adaptive streaming server and an adaptive streaming client. The adaptive streaming server stores N encoded representations of a multimedia content, where N≥ 2. These representations are encoded with different bitrates and correspond to different quality levels. For example, the first representation is encoded in bitrate that yields lowest acceptable quality whereas representation N requires the highest bitrate and provides the best quality. Each of the N representations is split up into M≥ 1 segments. A segment typically has a length of 10 seconds.

Each segment starts with an access point (in case of DASH it is called a Stream Access point (SAP)) which is a point from where it is possible to start decoding a segment without having access to earlier data in the same representation or to any other data (from other representations) and from where it is possible to get correctly decoded pixel values after having decoded a limited amount of data. Access point facilitates switching from one representation to another or simply acts as a random access point. A typical example of an access point is an intra picture. The adaptive streaming server also stores a manifest about the available representations and their segments. In MPEG-DASH this manifest is called a Media Presentation Description (MPD). MPD is an XML file that describes information such as timing of the content, location of each representation in the network, media characteristics such as media type, existence of various encoded alternatives, resolution, minimum and maximum bit rates (bandwidth) etc. The MPD can be delivered using HTTP, email, thumb drive, broadcast or other transports.

In order to play the content, the adaptive streaming client first obtains the manifest file. After parsing it in parser of manifest depicted in Fig. 1 , the client chooses an appropriate representation based on its capabilities and user choices and starts streaming of the content. In case of DASH, the segments are fetched using HTTP requests and are parsed in the segment parser prior to being sent to a media player. Based on the current network conditions the adaptive streaming client automatically selects from the available representations the next segment to download and playback. Typically, it selects the segment with the highest possible bitrate that can be downloaded in time for playback without causing stalls or rebuffering events in the playback.

Fig. 2 is a flow diagram illustrating a method of creating multiple representations of a video sequence 10 according to an embodiment. Step S1 creates a first representation 11 of the video sequence 10 with a periodic access point structure. Thus the first representation 11 is created in the same way as in other adaptive streaming solutions. Access points are usually intra pictures. Another example is an instantaneous decoder refresh (IDR) picture, which is a special type of an intra picture. An IDR picture specifies that no picture after the IDR picture can use any picture before it as a reference.

Step S2 creates a second representation 12 of the video sequence 10 with a sparser access point occurrence compared to the access point occurrence in the first representation 11. The second representation 12 has no constraints on the structure of access points as in case of other adaptive streaming solutions: it can be both aperiodic and periodic. A periodic structure means that a representation has access points placed at regular intervals. For example, every 8^th or every 32^nd picture may be an access point 15. Any structure that is not periodic is called aperiodic. Access points provide a tradeoff between the bitrate and frequency of random access or switching and their structure may be determined based on the video sequence content.

The second representation 12 has a higher quality and/or higher bitrate than the first representation 11. Equivalently, it can be said that the second representation 12 is more bitrate efficient than the first one 11. One way of measuring the quality is with objective quality metrics such as Peak Signal to Noise Ratio (PSNR), Signal to Noise Ratio (SNR), Mean Squared Error (MSE) or Structural Similarity (SSIM). There exist also subjective quality metrics, which deal with how video is perceived by a viewer. The output of subjective evaluation is usually a Mean Opinion Score (MOS) averaged over a number of viewers.

Fig. 3 illustrates a simplified example of the first 11 and second representations 12 according to the embodiment. The first representation 11 has a periodic access point structure: in this case every fourth picture is an access point. The second representation 12 has three access points: the first picture (every representation starts with an access point 15), the seventh picture and the fifteenth picture. In this example the second representation 12 has an aperiodic access point structure, but it may in general be both periodic and aperiodic. It should be noted that in this figure only very few pictures are shown for the sake of readability. In practice, the distance between the random access points would be much larger.

The first 11 and the second 12 representations may be created by encoding a raw video sequence with two different bitrates, quality and/or other constraints such as a number or a frequency of access points. Alternatively, the representations can be created from an existing pre-encoded content by decoding it and re-encoding it to meet some of the above mentioned constraints. The representations may be stored in the adaptive streaming server.

The number and/or positions of access points in the second representation 12 may be determined based on the video sequence 10 content. For example, an access point 15 may be placed at a position in which the compression efficiency loss compared to having an inter picture is small or even negative. A typical example is a position where there is a scene change in the video sequence.

Access points require substantially more bits than the other pictures. Hence, for the same bitrate, a representation having access points placed sparsely has a better quality than a representation that has more access points since the remaining pictures can be allocated more bits. Therefore, having inter pictures (P or B pictures) instead of access points in the second representation can easily be justified in cases when the clients receive the highest quality representation most of the time, which is the case for fixed fiber-to-the-home internet connections and for LTE connections. On the other hand, having fewer access points may imply having fewer points in a representation where switching or random access is possible. This tradeoff applies to the case when there are two representations; in case there are three representations there are additional possibilities for switching, as we will see below.

To get an estimation of how much bitrate can be saved for the second representation by removing its access points (in this case l-pictures), two different simulations have been conducted, referred to as anchor simulation and test simulation. In the anchor simulation, l-pictures are inserted in regular intervals (approximately every 32^nd picture). In the test simulation all l-pictures but the first one are converted to P-pictures for each tested video sequence. The bitrate savings for the test simulation compared to the anchor simulation is reported for two different test sets in Tables 1 and 2. From the tables we can see that the bitrate gain increases with an increase of quantization parameter (QP) for all the sequences. We can also see that the average bitrate gain ranges approximately from 10% to 20%.

Table 1 HEVC HM1 1 .0 8b YUV 4:2:0

Table 2 SCC HM14.0 8b YUV 4:4:4

Sequence Format Fps QP22 QP27 QP32 QP37

BasketbalLScreen 2560x1440 60 -26.30% -34.00% -40.10% -44.80%

EBURainFruits 1920x1080 50 -8.90% -12.30% -14.90% -17.10% Kimono 1920x1080 24 -3.80% -4.20% -4.40% -5.90%

MissionControlClip2 2560x1440 60 -5.70% -7.10% -8.70% -9.30%

MissionControlClip3 1920x1080 60 -7.20% -8.70% -1 1 .50% -17.10% sc_console 1920x1080 60 -4.10% -4.40% -5.10% -5.50% sc_desktop 1920x1080 60 -32.70% -31 .40% -29.80% -28.10% scJIyingGraphics 1920x1080 60 -0.60% -0.80% -1 .40% -2.10% scjnap 1280x720 60 -10.10% -10.70% -10.30% -13.00% sc_programming 1280x720 60 -3.60% -5.20% -8.40% -13.00% sc_robot 1280x720 30 -13.40% -21 .20% -27.20% -31 .30% sc_slideshow 1280x720 20 -16.10% -18.10% -20.10% -19.10% sc_web_browsing 1280x720 30 -14.20% -17.00% -20.40% -19.70%

Average -1 1 .28% -13.47% -15.56% -17.38%

Step S4 provides information on positions and/or time instances of switching points 16 for the created representations, wherein the switching points 16 facilitate switching from the first representation 1 1 to the second representation 12. A switching point can be an access point. However, a switching point can be any other picture given some constraints, as we will see below. This information will be used by a client when the network conditions allow switching from the first representation 11 to the second representation 12. The switching point can, for example, be an access point in the second representation 12, in which case it is possible to switch and start decoding pictures in the second representation 12 without having access to earlier pictures in the same representation. Step S4 may further provide information about the bitrate and the quality of the representations. This provided information is stored at the adaptive streaming server and the adaptive streaming client. By reading it, the client knows what streams are available and at which quality, what switching points exist and when it can request switching.

Fig. 4 is a flow diagram illustrating a method of creating multiple representations of a video sequence 10 according to another embodiment. Steps S1 and S2 of creating the first 1 1 and the second 12 representation respectively are the same as in the first embodiment. This embodiment further comprises step S3 that creates a third representation 13 of the video sequence 10. The third representation 13 has a periodic access point structure, where the number of access points is larger than the number of access points in the second representation 12. The third representation 13 facilitates switching from the first representation 1 1 to the second representation 12 by using at least one decoded picture from the third representation 13 as a reference when decoding pictures from the second representation 12.

Fig. 5 illustrates only a limited number of pictures from the first 11 , second 12 and third 13 representations. The first 11 and the third 13 representations have a periodic access point structure: in the first representation 11 every fourth picture is an access point whereas in the third one 13 every eighth picture is an access point (in practice, the distance between the access points would be much larger). On the other hand, only the first and the fifteenth pictures in the second representation 12 are access points.

Similar to the first embodiment, step S4 provides information on positions and/or time instances of switching points 16 for the created representations 11-13. This information will be used by a client when the network conditions allow switching from the first representation 11 to the second representation 12. Same as in the first embodiment, the switching point can be an access point in the second representation 12, allowing it to switch and start decoding pictures in the second representation 12 without having access to earlier pictures in the same representation. However, in this embodiment, thanks to having the third representation 13, a switching point may also be an inter picture in the second representation 12, as will be explained below. In the following examples intra pictures are used as access points.

Fig. 6 illustrates an example where the second representation 12 consists only of P-pictures, except for the first one that is an intra picture. The third representation 13 contains intra pictures for each picture from the second representation 12. In this example, the decoded sample values of the intra pictures in the third representation 13 are exactly the same as in the corresponding (collocated) pictures in the second representation 12, in order to allow random access or switching at any picture in the second representation 12. This can be realized by encoding the pictures in the third representation 13 in the lossless mode by using the corresponding (reconstructed) pictures from the second representation 12 as an input.

In order to reduce the bitrate of the third representation 13 due to lossless encoding of its pictures described above, it is possible to encode only the portions of the pictures in the third representation 13 that will be used as reference by future pictures in the lossless mode. This way it will be guaranteed that the decoding result will be the same for all pictures following the picture at which the random access is performed. If the picture from the third representation 13 is not output (e.g. realized through setting its PicOutputFlag equal to 0) then the encoder does not need to spend bits on trying to make the samples not used for prediction look good. Instead those sample values could be ignored or set to fixed value (such as 127) in the encoding process.

The amount of blocks that are encoded using prediction from earlier pictures (inter prediction) highly depends on the content of the video. At a scene cut, very few blocks will use prediction from a previous picture since that picture contains different content and finding blocks that provide a good reference will typically be difficult. Fig. 7 illustrates an example of a picture that represents a scene cut and which portions of the picture that are encoded as intra (these are the ones that do not use a reference to any previous picture) and which are encoded in inter mode (these are the ones that do use a reference to a previous picture). The pictures directly preceding the scene cut would be very cheap since only very few sample values will be used by future pictures. A client that wants to minimize the overhead of using the third representation 13 picture can perform the random access operation at a point where the third representation 13 picture does not consume many bits, such as right before the scene cut.

The third representation 13 may not have the same frame rate as the first 11 and second 12 representations. For example, the third representation 13 might have a frame rate of just one picture per second in order to reduce the storage size required for that representation. Or, in case the second representation 12 consists of pictures that are not used as reference (for example B-pictures), it is not useful to have the pictures corresponding to these non-reference pictures in the third representation 13 as they will never be used.

The third representation 13 may consist of intra pictures that are placed at positions of every k-th reference picture in the second representation 12, where k > 1.

In some scenarios it might be desirable to use an already existing, pre-encoded, representation as one of the representations. For example, one can use one such pre-encoded representation, with a periodic intra picture structure, as the third representation 13. The second representation 12 can then be created from the third representation 13 by encoding the pictures at the positions of intra pictures as inter pictures (by using lossless coding or by trying to minimize the visual artefacts) and copying the rest of the encoded inter pictures as they are. In adaptive streaming scenarios there can be clearly more than two quality levels that a client may request and thus the number of representations can also be higher than two. The solutions presented so far could of course be applied to any representation, not only to the highest quality one. In such cases one could have representations facilitating switching between two consecutive quality level representations, one representation for each pair of consecutive quality levels. These representations may be independently created. However, to save storage space, the pictures in a representation used for switching between representations Y-1 and Y may reference pictures in a representation used for switching between representations Y-2 and Y-1, where 3 < Y < L, where L is the number of quality levels. Alternatively, one may create representations that are used for switching between two non- consecutive quality level representations, for example representations Y-2 and Y.

Fig. 8 depicts one such scenario of dependencies between representations used for switching. This may for instance be achieved utilizing a scalable coding scheme such as SHVC (scalable extension of HEVC) or SVC (scalable extension of H.264/AVC). When switching to a higher quality representation Y, before decoding each required picture in the representation used for switching between representations Y-1 and Y, the referenced pictures in the representation used for switching between representations Y-2 and Y-1 must first be decoded.

So far it has been explained how multiple representations of a video sequence may be created so that switching between the representations or random access within the same representation is facilitated. In what follows it will be described how the switching or random access may be performed.

Fig. 9 is a flow diagram illustrating a method for a client for accessing a video sequence 10 at a position n 20 that is not a start of the video sequence 10. An example of a client is an MPEG-DASH client described above (or any adaptive streaming client). Here by accessing a video sequence it is meant either moving to the position n within the same representation (e.g. random access, fast forward) or switching between two different representations. The video sequence 10 comprises at least a first representation 11 and a second representation 12. The first representation 11 has a periodic access point structure and the second representation 12 has an aperiodic or a periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation 11. The second representation 12 has a higher quality and/or higher bitrate than the first representation 11. Here the access point serves as a random access point or facilitates switching between different representations of the video sequence 10, similar as in the method of creating multiple representations described above. The method comprises a step S5 of determining if the second representation 12 has an access point 15 at position n 20. This can be done by checking the information (manifest file) 14 provided when the representations are created and which is stored both at the adaptive streaming server and the adaptive streaming client. Alternatively, a decoder may parse the second representation's bitstream in order to determine if there is an access point 15 at position n. If the second representation 12 has an access point 15 at position n 20, a step S6 of accessing the second representation 12 is performed. If the second representation 12 does not have an access point 15 at position n 20, the method accesses the first representation at position n 20 (step S7). In step S8 switching from the first representation 11 to the second representation 12 is performed at a position following position n 20.

Switching from the first 11 to the second 12 representation performed in step S8 may be performed at a position of an access point 15 in the second representation 12. This is depicted in Fig. 10 where access points are intra pictures. When the client would like to access picture n in the second representation 12, it will have to find the closest intra picture in the second representation 12 following picture n. In this case, this is picture n+4. Thus pictures n, n+1 , n+2 and n+3 are fetched from the first representation 11 , and n+4 onwards are fetched from the second representation 12.

Switching from the first 11 to the second 12 representation performed in step S8 may be performed at a position that does not correspond to an access point 15. However, this requires having the third representation 13 of the video sequence 10. One such example is depicted in Fig. 6. The second representation 12 does not have any intra pictures in the part of the representation shown. Instead there is the third representation 13 with intra pictures for which the decoded sample values are exactly the same as the corresponding picture in the second representation 12. When performing random access or when switching up to the second representation 12 from the first representation 11 at e.g. picture n+3, the picture n+3 from the third representation 13 is decoded and then pictures from the second representation 12 are decoded using the decoded picture n+3 from the third representation 13 for reference. The solution can be realized with having to decode more than one picture from the third representation 13. Modern video codecs (such as HEVC and H.264/AVC) allow the use of multiple reference pictures. Consider the case shown in Fig. 11. In order to switch to the second representation 12 at picture n+5 then both n+3 and n+4 from the third representation 13 must be decoded. Modern codecs may also have pictures that reference pictures that come after in representation time (but proceeds in decoding time). In this case, in order to switch to the representation, the subsequent, in representation order, referenced picture of the third representation must be decoded. Consider the example in Fig. 12, where the second representation 12 is coded using an IBBP structure where only the first picture is coded as an l-picture. The decoding order for the second representation 12 pictures in this example is: P_n+i , B_n, P_n+4, B_n+2, B_n+3, Pn+7, B_n+5, Pn+6. In order to switch to the second representation at picture n+3, both pictures n+1 and n+4 from the third representation 13 need to be decoded. Another example of switching at an arbitrary position is depicted in Fig. 13. It corresponds to the case explained above where the already existing pre-encoded representation (with a periodic intra picture) is assigned as the third representation 13 and where the second representation 12 is encoded from the third representation 13 by encoding the pictures at the position of the intra pictures as inter pictures and copying the rest of the encoded inter pictures as they are. Since the inter pictures corresponding to the positions of intra pictures in the third representation 13 are obtained either by using a lossless mode or by trying to minimize the visual artefacts, and since the remaining inter pictures are simply copied, it is possible to make a switch at for example picture n+3 by decoding pictures n, n+1 and n+2 from the third representation 13 after which the pictures are decoded from the second representation 12.

Fig. 14 is a schematic block diagram of an encoder 100 for creating multiple representations of a video sequence 10 according to an embodiment. The encoder 100 may be a part of the adaptive streaming server (for example, a MPEG-DASH server described above). The encoder 100 comprises a first representation creator 110, configured to create a first representation 11 of the video sequence 10. The first representation 11 is created with a periodic access point structure, wherein the access point serves as a random access point or facilitates switching between different representations of the video sequence. The encoder 100 further comprises a second representation creator 120, configured to create a second representation 12 of the video sequence 10. The second representation 12 is created with an aperiodic or periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation 11. The second representation 12 is created with a higher quality and/or higher bitrate than the first representation 11. The encoder 100 further optionally comprises a third representation creator 130, configured to create a third representation 13 of the video sequence 10. The third representation 13 is created with a periodic access point structure, wherein the number of access points in the third representation 13 is larger than the number of access points in the second representation 12. The encoder 100 further comprises an information provider 140 configured to provide information 14 on the positions and/or time instances of switching points 16 for the created representations, wherein the switching points 16 facilitate switching from the first representation 11 to the second representation 12.

The first representation creator 110 may utilize one of the existing video coding algorithms such as H.264/AVC or HEVC to create the first representation 11 of the video sequence where it is possible to specify the periodicity (frequency) of access points. For example, one may specify the Group of Pictures (GOP) size prior to encoding. It is also possible to specify that a GOP starts with an intra picture in case an access point 15 is an intra picture.

Similarly, the second representation creator 120 and the third representation creator 130 may utilize one of the existing video coding algorithms such as H.264/AVC or HEVC to create the second representation 12 of the video sequence. Both aperiodic and periodic access point structures are allowed with these algorithms. The quality of a representation may be controlled with a quantization parameter (QP): the lower the QP the higher the quality of a representation. Thus one can choose a QP for the second representation 12 that is lower than a QP for the first representation 11. Specifying a bitrate directly is also possible. Specifying a larger number of access points 15 in the third representation 13 than in the second representation 12 is possible by for example choosing a smaller GOP size.

The information provider 140 may utilize an output from the first 110, second 120 and third 130 representation creators respectively to get information on the positions and/or time instances of switching points 16 for the created representations. For example, H.264/AVC and HEVC usually provide log files after encoding where the information about the type of pictures, frequency of intra pictures or similar can be found. The information provider can then create a separate file, for example an XML file where all the important information is stored.

The encoder 100 can be implemented in hardware, in software or a combination of hardware and software. The encoder 100 can be implemented in user equipment, such as a mobile telephone, tablet, desktop, netbook, multimedia player, video streaming server, set-top box or computer. The encoder 100 may also be implemented in a network device in the form of or connected to a network node, such as radio base station, in a communication network or system. Although the respective units disclosed in conjunction with Fig. 14 have been disclosed as physically separate units in the device, where all may be special purpose circuits, such as ASICs (Application Specific Integrated Circuits). Alternative embodiments of the device are possible where some or all of the units are implemented as computer program modules running on a general purpose processor. Such an embodiment is disclosed in Fig. 15.

Fig. 15 schematically illustrates an embodiment of a computer 200 having a processing unit 210 such as a DSP (Digital Signal Processor) or CPU (Central Processing Unit). The processing unit 210 can be a single unit or a plurality of units for performing different steps of the method described herein. The computer also comprises an input/output (I/O) unit 220 for receiving a video sequence. The I/O unit 220 has been illustrated as a single unit in Fig. 15 but can likewise be in the form of a separate input unit and a separate output unit.

Furthermore, the computer comprises at least one computer program product 230 in the form of a non- volatile memory, for instance an EEPROM (Electrically Erasable Programmable Read-Only Memory), a flash memory or a disk drive. The computer program product 230 comprises a computer program 240, which comprises code means which when run on the computer 200, such as by the processing unit 210, causes the computer 200 to perform the steps of the method described in the foregoing in connection with Figs. 2 and 4.

According to a further aspect an encoder 100 for creating multiple representations of a video sequence 10 is provided as illustrated in Fig. 16. The processing means is exemplified by a CPU (Central Processing Unit) 210. The processing means is operative to perform the steps of the method described in the foregoing in connection with Figs. 2 and 4. That implies that the processing means 210 is operative to create (S1) a first representation 11 of the video sequence with a periodic access point structure, wherein the access point serves as a random access point or facilitates switching between different representations of the video sequence 10. The processing means 210 is further operative to create (S2) a second representation 12 of the video sequence 10 with an aperiodic or periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation, wherein the second representation has a higher quality and/or higher bitrate than the first representation. The processing means 210 may optionally be operative to create (S3) a third representation 13 of the video sequence with a periodic access point structure, wherein the number of access points in the third representation 13 is larger than the number of access points in the second representation 12. The processing means 210 is further operative to provide (S4) information on the positions and/or time instances of switching points 16 for the created representations, wherein the switching points 16 facilitate switching from the first representation 11 to the second representation 12.

Fig. 17 is a schematic block diagram of an arrangement 300 for accessing a video sequence 10 at a 5 position n 20 that is not the start of the video sequence according to an embodiment, wherein the video sequence 10 comprises at least a first representation 11 and a second representation 12. The first representation 11 has a periodic access point structure and the second representation 12 has an aperiodic or a periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation 11 , wherein the access point 15 serves as a random 10 access point or facilitates switching between different representations of the video sequence 10 and wherein the second representation 12 has a higher quality and/or higher bitrate than the first representation 11. The arrangement 300 comprises a determiner 310, configured to determine if the second representation 12 has an access point 15 at position n 20. The arrangement 300 further comprises an accessing module 320, configured to access the second representation 12 at position n 15 20. The arrangement 300 further comprises an accessing module 330, configured to access the first representation 11 at position n 20. The arrangement 300 further comprises a switcher 340 configured to switch from the first representation 11 to the second representation 12 at a position following position n 20.

20 The determiner 310 may utilize the information 14 provided by the information provider 140 to determine whether the second representation 12 has an access point at position n 20.

The accessing modules 320 and 330 respectively may utilize the information 14 on the positions and/or time instances of the access points in the second 12 and the first 11 representation, provided by 25 the information provider 140, to access the second 12 and the first 11 representations respectively at position n 20.

The switcher 340 may utilize the information 14 on the positions and/or time instances of the access points in the second 12 and the first 11 representation, provided by the information provider 140, to 30 access the first representation 11 from the position n 20 to a position following position n 20 and to access the second representation 12 at the position following position n 20.

The arrangement 300 can be implemented in hardware, in software or a combination of hardware and software. The arrangement 300 can be implemented in user equipment, such as a mobile telephone, tablet, desktop, netbook, multimedia player, video streaming server, set-top box or computer. The arrangement 300 may also be implemented in a network device in the form of or connected to a network node, such as radio base station, in a communication network or system. Although the respective units disclosed in conjunction with Fig. 17 have been disclosed as physically separate units in the device, where all may be special purpose circuits, such as ASICs (Application Specific Integrated Circuits). Alternative embodiments of the device are possible where some or all of the units are implemented as computer program modules running on a general purpose processor. Such an embodiment is disclosed in Fig. 18.

Fig. 18 schematically illustrates an embodiment of a computer 400 having a processing unit 410 such as a DSP (Digital Signal Processor) or CPU (Central Processing Unit). The processing unit 410 can be a single unit or a plurality of units for performing different steps of the method described herein. The computer also comprises an input/output (I/O) unit 420 for receiving at least the first 11 and the second representation 12 of the video sequence and the information on the positions and/or time instances of the switching points 16. The I/O unit 420 has been illustrated as a single unit in Fig. 18 but can likewise be in the form of a separate input unit and a separate output unit.

Furthermore, the computer comprises at least one computer program product 430 in the form of a non- volatile memory, for instance an EEPROM (Electrically Erasable Programmable Read-Only Memory), a flash memory or a disk drive. The computer program product 430 comprises a computer program 440, which comprises code means which when run on the computer 400, such as by the processing unit 410, causes the computer 400 to perform the steps of the method described in the foregoing in connection with Fig. 9.

According to a further aspect an arrangement 300 for accessing a video sequence 10 at a position n 20 that is not the start of the video sequence 10 is provided as illustrated in Fig. 19. The processing means are exemplified by a CPU (Central Processing Unit) 410. The processing means is operative to perform the steps of the method described in the foregoing in connection with Fig. 9. That implies that the processing means 410 are operative to determine (S5) if the second representation 12 has an access point 15 at position n 20. The processing means 410 is further operative to access (S6) the second representation 11 at position n 20. The processing means 410 are further operative to access (S7) the first representation 11 at position n 20. The processing means 410 are further operative to switch (S8) from the first representation 11 to the second representation 12 at a position following position n 20.

The processing means 410 may be further operative to switch from the first representation 11 to the second representation 12 at a position of an access point 15 in the second representation 12.

The embodiments described above are to be understood as a few illustrative examples of the present invention. It will be understood by those skilled in the art that various modifications, combinations and changes may be made to the embodiments without departing from the scope of the present invention. In particular, different part solutions in the different embodiments can be combined in other configurations, where technically possible.

Claims

1. A method of enabling switching to a video sequence (10) by creating multiple representations of the video sequence (10), the method comprising:

creating (S1) a first representation (11) of the video sequence (10) with a periodic access point structure, wherein an access point (15) serves as a random access point or facilitates switching between different representations of the video sequence (10);

creating (S2) a second representation (12) of the video sequence (10) with a sparser access point occurrence compared to the access point occurrence in the first representation (11) and with an aperiodic or periodic access point structure, wherein the second representation (12) has a higher quality and/or higher bitrate than the first representation (11);

providing (S4) information (14) on positions and/or time instances of switching points (16) for the created representations (11-12), wherein the switching points (16) facilitate switching from the first representation (11) to the second representation (12).

2. The method according to claim 1 , further comprising:

creating (S3) a third representation (13) of the video sequence (10) with a periodic access point structure, wherein the number of access points in the third representation (13) is larger than the number of access points in the second representation (12).

3. The method according to claim 2 wherein switching from the first representation (11) to the second representation (12) is facilitated through using at least one decoded picture from the third representation (13) as a reference when decoding pictures from the second representation (12).

4. The method according to claims 2-3, wherein the third representation (13) comprises access points at positions of every k-th reference picture in the second representation (12), where k > 1.

5. The method according to claims 2-4, wherein the decoded samples of each access point (15) in the third representation (13) are exactly the same as the decoded samples of the collocated picture in the second representation (12).

6. The method according to claims 2-4, wherein the decoded samples of each access point (15) in the third representation (13) are exactly the same as the decoded samples of the collocated picture in the second representation (12) that are used for reference by future pictures.

7. The method according to claim 5 or 6 wherein the pictures in the third representation (13) are obtained by encoding the reconstructed pictures from the second representation (12).

8. The method of claim 1 , wherein the number and/or positions of access points in the second 5 representation (12) are determined based on the video sequence (10) content.

9. The method according to any of the preceding claims wherein the access point (15) is an intra picture.

10 10. A method for accessing a video sequence (10) at a position n (20) that is not a start of the video sequence (10), wherein the video sequence (10) is comprising at least a first representation (11) and a second representation (12), wherein the first representation (11) has a periodic access point structure and the second representation (12) has an aperiodic or a periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation

15 (11), wherein the access point (15) serves as a random access point or facilitates switching between different representations of the video sequence (10), wherein the second representation (12) has a higher quality and/or higher bitrate than the first representation (11 ), the method comprising:

determining (S5) if the second representation (12) has one access point (15) at position n (20); if the second representation (12) has one access point (15) at position n (20):

20 accessing (S6) the second representation (12) at position n (20);

if the second representation (12) does not have one access point (15) at position n (20):

accessing (S7) the first representation (11) at position n (20);

switching (S8) from the first representation (11) to the second representation (12) at a position following position n (20).

25

11. The method according to claim 10, wherein the switching from the first representation (11) to the second representation (12) is performed at a position of the access point (15) in the second representation (12).

30 12. The method according to any of claims 10 -11 , wherein switching from the first representation (11) to the second representation (12) is performed such that one or more decoded pictures from the first representation (11) are used as reference in the second representation (12).

13. The method according to claim 12, wherein decoding of the pictures from the second representation (12) after switching from the first representation (11) to the second representation (12) results in identical sample values as if the decoding of the second representation (12) was performed from the start of the video sequence (10).

5

14. The method according to any of claims 10-13, wherein the sample values of the decoded pictures in the second representation (12) are identical to the sample values of the collocated decoded pictures in the first representation (11).

10 15. The method according to claims 10-12, wherein the decoded pictures from the first representation (11) are not output after being decoded.

16. An encoder (100) for creating multiple representations of a video sequence (10), the encoder (100) comprising processing means (210) operative to:

15 create a first representation (11) of the video sequence (10) with a periodic access point structure, wherein the access point (15) serves as a random access point or facilitates switching between different representations of the video sequence (10);

create a second representation (12) of the video sequence (10) with an aperiodic or periodic access point structure and a sparser access point occurrence compared to the access point 20 occurrence in the first representation (11), wherein the second representation (12) has a higher quality and/or higher bitrate than the first representation (11);

provide information (14) on the positions and/or time instances of switching points (16) for the created representations, wherein the switching points (16) facilitate switching from the first representation (11) to the second representation (12).

25

17. The encoder (100) according to claim 16, wherein the processing means (210) comprise a processor (250) and a memory (230) wherein said memory (230) is containing instructions executable by said processor (250).

30 18. The encoder (100) according to any of claims 16-17, wherein the processing means (210) is further operative to:

create a third representation (13) of the video sequence (10) with a periodic access point structure, wherein the number of access points in the third representation (13) is larger than the number of access points in the second representation (12).

19. An arrangement (300) for accessing a video sequence (10) in adaptive streaming at a position n (20) that is not the start of the video sequence (10), wherein the video sequence (10) is comprising at least a first representation (11) and a second representation (12), wherein the first representation (11) has a periodic access point structure and the second representation (12) has an aperiodic or a periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation (11), wherein the access point (15) serves as a random access point or facilitates switching between different representations of the video sequence (10), wherein the second representation (12) has a higher quality and/or higher bitrate than the first representation (11), the arrangement comprising processing means (410) operative to:

determine if the second representation (12) has one access point (15) at position n (20);

if the second representation (12) has one access point (15) at position n (20):

access the second representation (12) at position n (20);

access the first representation (11) at position n (20);

switch from the first representation (11) to the second representation (12) at a position following position n (20).

20. The arrangement (300) according to claim 19, wherein the processing means comprise a processor (450) and a memory (430) wherein said memory (430) is containing instructions executable by said processor (450).

21. The arrangement (300) according to any of claims 19-20, wherein the processing means (410) is further operative to switch from the first representation (11) to the second representation (12) at a position of an access point (15) in the second representation (12).

22. A computer program (240) for creating multiple representations of a video sequence (10), the computer program (240) comprising code means which, when run on a computer (200), causes the computer (200) to:

create a first representation (11) of the video sequence (10) with a periodic access point structure, wherein the access point (15) serves as a random access point or facilitates switching between different representations of the video sequence (10);

create a second representation (12) of the video sequence (10) with an aperiodic or periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation (11), wherein the second representation (12) has a higher quality and/or higher bitrate than the first representation (11);

provide information (14) on the positions and/or time instances of switching points (16) for the created representations, wherein the switching points (16) facilitate switching from the first 5 representation (11) to the second representation (12).

23. The computer program (240) according to claim 22, further causing the computer (200) to create a third representation (13) of the video sequence (10) with a periodic access point structure, wherein the number of access points in the third representation (13) is larger than the number of

10 access points in the second representation (12).

24. A computer program (440) for accessing a video sequence at a position n (20) that is not the start of the video sequence (10), wherein the video sequence (10) is comprising at least a first representation (11) and a second representation (12), wherein the first representation (11) has a

15 periodic access point structure and the second representation (12) has an aperiodic or a periodic access point structure and a sparser access point occurrence compared to the access point occurrence in the first representation (11), wherein the access point (15) serves as a random access point or facilitates switching between different representations of the video sequence (10), wherein the second representation (12) has a higher quality and/or higher bitrate than the first representation (11),

20 the computer program (440) comprising code means which, when run on a computer (400) causes the computer (400) to:

if the second representation (12) has one access point (15) at position n (20):

access the second representation (12) at position n (20);

25 if the second representation (12) does not have one access point (15) at position n (20):

access the first representation (11) at position n (20);

30 25. The computer program (440) according to claim 24, further causing the computer (400) to switch from the first representation (11) to the second representation (12) at a position of an access point (15) in the second representation (12).

26. A computer program product (500) comprising computer readable means (510) and a computer program (240) according to claims 22-23 stored on the computer readable means (510).

27. A computer program product (600) comprising computer readable means (610) and a computer program (440) according to claims 24-25 stored on the computer readable means (610).