GB2560923A

GB2560923A - Video streaming

Info

Publication number: GB2560923A
Application number: GB1704887.7A
Authority: GB
Inventors: Hourunranta Ari; Guldogan Esin
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2017-03-28
Filing date: 2017-03-28
Publication date: 2018-10-03
Also published as: GB201704887D0

Abstract

Methods and systems for video streaming, particularly view-dependent video streaming are disclosed. A current field-of-view 60 (FOV), within a 360-degree view field 50, of a user 55 is determined, with the background 70 not being visible to the user. Next, one or more first video segments 80a-80h representing at least part of the current field-of-view are displayed to a user device. Future movement of the device is estimated or predicted to determine a future field-of-view (Fig.9, 61). One or more second video segments are requested by the apparatus, representing at least part of the future field-of-view. The second video segments are received and stored for subsequent display to the user device. The video segments may be related to different sub-areas of the FOV. The upcoming course may be estimated from the previous trajectory of the user and/or from its velocity. The data rates for receiving the first and second image portions may be adjusted according to said motion, also to preserve the overall bandwidth. Said bit rates may also depend on a threshold related to the users movement.

Description

(,2,UK Patent Application „_a,GB ,,,,2560923 „_aA (43) Date of A Publication_03.10.2018

(21) Application No: 1704887.7 (22) Date of Filing: 28.03.2017	(51) INT CL: G06T3/40 (2006.01) G06F3/01 (2006.01) H04N 5/232 (2006.01)	G02B 27/01 (2006.01) G06T 15/20 (2011.01)
(71) Applicant(s): Nokia Technologies Oy Karaportti 3, 02610 Espoo, Finland	(56) Documents Cited: EP 3112985 A1 US 7411594 A1 US 20100250120 A1	EP 3065406 A1 US 20160086306 A1 US 20100232770 A1
(72) Inventor(s): Ari Hourunranta Esin Guldogan	(58) Field of Search: INT CL G02B, G03F, G06T, H04N Other: EPODOC, WPI.
(74) Agent and/or Address for Service: Nokia Technologies Oy IPR Department, Karakaari 7, 02610 Espoo, Finland
(54) Title of the Invention: Video streaming Abstract Title: View-dependent video streaming

(57) Methods and systems for video streaming, particularly view-dependent video streaming are disclosed. A current field-of-view 60 (FOV), within a 360-degree view field 50, of a user 55 is determined, with the background 70 not being visible to the user. Next, one or more first video segments 80a-80h representing at least part of the current field-of-view are displayed to a user device. Future movement of the device is estimated or predicted to determine a future field-of-view (Fig.9, 61). One or more second video segments are requested by the apparatus, representing at least part of the future field-of-view. The second video segments are received and stored for subsequent display to the user device. The video segments may be related to different sub-areas of the FOV. The upcoming course may be estimated from the previous trajectory of the user and/or from its velocity. The data rates for receiving the first and second image portions may be adjusted according to said motion, also to preserve the overall bandwidth. Said bit rates may also depend on a threshold related to the user’s movement.

1/9

Fig. 2 (a)

2/9

CO

Fig. 2 (b)

CO

3/9

r*

LT)

4/9

100 110 120

ο

ΓΜ τΗ

V .6)

LL

5/9 ο rt τ—I

134

CO τ—I

Fig. 5

6/9

O')

LD

7/9

LO

«it i<

Fig. 7

8/9

V

D	c
<U	1)
+-» Π3	E
E	1) >
+-»	o
to LU	2

r3-

’’ A	1	OJO c 'i_ to Q. TO O 1) <75 Σ	oo
r			.o>
			IL

9/9

Ο

LT) τΗ

-σ ο

LT) τΗ

-Ω

Ο

LT) τΗ

Ο

LT) τΗ .6)

LL

Intellectual

Property

Office

Application No. GB1704887.7

RTM

Date :31 August 2017

The following terms are registered trade marks and should be read as such wherever they occur in this document:

Bluetooth (Page 8)

Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo

-1 Video Streaming

Field of the Invention

This invention relates to video streaming, particularly view-dependent video streaming.

Background

It is known to provide data representing large scenes to users, only part of which can be seen at a given time. For example, in the field of virtual reality (VR), it is known to transmit video data representing a 360 degree video over a network to a user device, such as a VR headset.

The user will only see a portion of the 360 degree video, typically a 180 degree field-of-view, whilst the remainder of the video is in the background.

One of the main challenges for streaming such video is the high bandwidth requirements for acceptable subjective video quality. In the case of limited bandwidth, both seen and unseen portions of the video may have the same resolution. In view dependent streaming, only the portion that can be seen is streamed at a high quality, whilst the remaining, unseen portion is streamed at a lower quality.

One method for view-dependent delivery is to split the video images into tiles and download and decode in high quality only the tiles that are within the user’s current field-of-view. The remaining part(s) of the video image are downloaded in very low quality, as a short-term fall back. Both the tiles and the background are streamed as segments that typically cover a duration of a few seconds. Segments typically start at a random access point, but because of the duration, there will be considerable latency if the playing device, e.g. a VR headset, waits for the next boundary before switching tiles to show the new viewpoint. Hence, the playing device needs to download segment(s) for the new tile(s) at the same time that it is displaying the video, and try to catch up with decoding the new tile(s).

Typically, the playing device will attempt to use all available bandwidth for downloading segments. There are rate adaptation mechanisms available where the playing device can select different bit-rate representations of the stream that best matches the available network throughput. In the case of quick or continuous head movement, for example, these systems need to download additional switching segments for the new tiles. Since the available bandwidth is usually already utilised for downloading the current segments, downloading the additional tiles may take longer, or downloading subsequent segments may be delayed. The new field-of-view therefore will not have a high subjective quality.

- 2 Summary of the Invention

A first aspect of the invention provides a method comprising: determining a current field-ofview of a user device; displaying to the user device one or more video segments representing respective sub-areas of a larger video image for showing at least part of the current field-of5 view; estimating future movement of the user device to determine a future field-of-view; requesting from a remote device one or more further video segments representing at least part of the future field-of-view; and receiving and storing the further video segments for display to the user device.

The current field-of-view may be represented by one or more first video segments, and the future field-of-view is represented by one or more second video segments, one or more of which corresponds to a different sub-area of the larger video image.

The method may further comprise displaying at least some of the stored second video segments to the user device if subsequent movement corresponds to the estimated future movement.

The estimated future movement may be estimated based on prior movement of the user device measured over a finite time period.

The estimated future movement may be estimated based on one or both of direction of movement and speed of movement.

The respective data-rate(s) may be used for receiving the first and/or second video segments is/are adjusted based on movement.

The respective data-rate(s) may be adjusted to maintain the overall bandwidth within a predetermined maximum level.

A first data-rate may be used for receiving the first video segments if movement is below a first threshold, and a second, lower data-rate may be used for said first video segments if movement is above said first threshold.

The respective data-rates for receiving the first and second video segments may be substantially the same if movement is above said first threshold.

-3The method may comprise the user device selects one of a plurality of bit-streams, each for providing the first and second video segments at different data rates, for causing the first and second video segments to be received at said selected data rate.

In the event that the one or more further video segments represent the same respective subareas as those of the current field-of-view, the method may comprise the requesting of the one or more further video segments is prioritised based on the estimated movement.

The requesting of the one or more further video segments may be prioritised such that the 10 segment(s) in the direction of estimated movement is or are requested first.

The user device may be a wearable user device. The user device may be a headset, e.g. a VR headset.

The video segments may be encoded using mono, stereo or multi-view coding.

A second aspect of the invention provides a computer program comprising instructions that when executed by a computer program control it to perform the method of determining a current field-of-view of a user device; displaying to the user device one or more video segments representing respective sub-areas of a larger video image for showing at least part of the current field-of-view; estimating future movement of the user device to determine a future field-of-view; requesting from a remote device one or more further video segments representing at least part of the future field-of-view; and receiving and storing the further video segments for display to the user device.

A third aspect of the invention provides a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising: determining a current field-of-view of a user device; displaying to the user device one or more video segments representing respective sub-areas of a larger video image for showing at least part of the current field-of-view; estimating future movement of the user device to determine a future field-of-view; requesting from a remote device one or more further video segments representing at least part of the future field-of-view; and receiving and storing the further video segments for display to the user device.

A fourth aspect of the invention provides an apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor: to determine a current field-of-view of a

-4user device; to display to the user device one or more video segments representing respective sub-areas of a larger video image for showing at least part of the current field-of-view; to estimate future movement of the user device to determine a future field-of-view; to request from a remote device one or more further video segments representing at least part of the future field-of-view; and to receive and storing the further video segments for display to the user device.

A fifth aspect of the invention provides an apparatus configured to perform the steps of determining a current field-of-view of a user device; displaying to the user device one or more video segments representing respective sub-areas of a larger video image for showing at least part of the current field-of-view; estimating future movement of the user device to determine a future field-of-view; requesting from a remote device one or more further video segments representing at least part of the future field-of-view; and receiving and storing the further video segments for display to the user device.

Brief Description of the Drawings

The invention will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:

Figure 1 is a perspective view of a VR display system, useful for understanding the invention; Figure 2a is a block diagram of a computer network including the Figure 1 VR display system, according to embodiments of the invention;

Figure 2b is a schematic diagram of an example VR capture scenario, which may be associated with a content provider shown in Figure 2a;

Figure 3a is a schematic top-plan view of a virtual space;

Figure 3b is a schematic internal view of part of the Figure 3a virtual space;

Figure 4 is a timeline for indicating downloading and processing stages of a known VR system, useful for understanding the invention;

Figure 5 is a block diagram of components of a VR media player forming part of the Figure 2

VR display system;

Figure 6 is a flow diagram showing processing stages of the VR media player, according to embodiments of the invention;

Figure 7 is a flow diagram showing further processing stages of the VR medial player, according to embodiments of the invention;

Figure 8 is a graph indicative of measured and predicted movement with respect to time; Figure 9 is a schematic top-plan view of a virtual space in which the Figure 8 movement and predicted movement is indicated;

-5Figure 10 is a schematic internal view of part of the Figure 3a virtual space, relating to a further embodiment.

Detailed Description of Preferred Embodiments

Embodiments herein relate to video streaming, for example video streaming between a content provider and one or more user devices over a network, for example over an IP network such as the Internet.

More particularly, embodiments relate to view-dependent video streaming, where the data 10 that is streamed from the source of the video content to a user end system is dependent on the position or orientation of the user.

In some embodiments, the video stream may represent a part of an overall, wide-angle video scene. For example, the video scene may cover a field which is greater than a viewer’s typical field-of-view, e.g. greater than 180⁰. Therefore, embodiments are particularly suited to applications where a user may consume and/or interact with an overall video scene greater than 180⁰ and possibly up to 360⁰.

One use case is virtual reality (VR) content whereby video content is streamed to a VR display system. As is known, the VR display system may be provided with a live or stored feed from a video content source, the feed representing a virtual reality space for immersive output through the display system. In some embodiments, audio is provided, which maybe spatial audio.

Nokia’s OZO (RTM) VR camera is an example of a VR capture device which comprises a microphone array to provide a spatial audio signal, but it will be appreciated that the embodiments are not limited to VR applications nor the use of microphone arrays at the video capture point.

Figure 1 is a schematic illustration of a VR display system 1. The VR system 1 includes a VR headset 20, for displaying visual data in a virtual reality space, and a VR media player 10 for rendering visual data on the VR headset 20.

In the context of this specification, a virtual space is any computer-generated version of a space, for example a captured real world space, in which a user can be immersed. The VR headset 20 may be of any suitable type. The VR headset 20 may be configured to provide VR video and audio content to a user. As such, the user may be immersed in virtual space.

- 6 The VR headset 20 receives visual content from a VR media player 10. The VR media player 10 may be part of a separate device which is connected to the VR headset 20 by a wired or wireless connection. For example, the VR media player 10 may include a games console, or a PC configured to communicate visual data to the VR headset 20.

Alternatively, the VR media player 10 may form part of the display for the VR headset 20.

Here, the media player 10 may comprise a mobile phone, smartphone or tablet computer configured to play content through its display. For example, the device may be a touchscreen device having a large display over a major surface of the device, through which video content can be displayed. The device may be inserted into a holder of a VR headset 20. With these headsets, a smart phone or tablet computer may display visual data which is provided to a user’s eyes via respective lenses in the VR headset 20. The VR display system 1 may also include hardware configured to convert the device to operate as part of VR display system 1.

Alternatively, VR media player 10 may be integrated into the VR display device 20. VR media player 10 may be implemented in software. In some embodiments, a device comprising VR media player software is referred to as the VR media player 10.

The VR display system 1 may include means for determining the spatial position and/or orientation of the user’s head. Over successive time frames, a measure of movement may therefore be calculated and stored. Such means may comprise part of the VR media player 10. Alternatively, the means may comprise part of the VR display device 20. For example, the VR display device 20 may incorporate motion tracking sensors which may include one or more of gyroscopes, accelerometers and structured light systems. These sensors generate position data from which a current visual field-of-view (FOV) is determined and updated as the user changes position and/or orientation. The VR display device 20 will typically comprise two digital screens for displaying stereoscopic video images of the virtual world in front of respective eyes of the user, and also two speakers for delivering audio, if provided from the VR system. The embodiments herein, which primarily relate to the delivery of VR content, are not limited to a particular type of VR display device 20.

The VR display system 1 may be configured to display visual data to the user based on the spatial position of the display device 20 and/or the orientation of the user’s head. A detected change in spatial position and/or orientation, i.e. a form of movement, may result in a corresponding change in the visual data to reflect a position or orientation transformation of the user with reference to the virtual space into which the visual data is projected. This allows VR content to be consumed with the user experiencing a 3D VR environment.

-ΊThe VR display device 20 may display non-VR video content captured with two-dimensional video or image devices, such as a smartphone or a camcorder, for example. Such non-VR content may include a framed video or a still image. The non-VR source content may be 2D, stereoscopic or 3D. The non-VR source content includes visual source content, and may optionally include audio source content. Such audio source content may be spatial audio source content. Spatial audio may refer to directional rendering of audio in the virtual space such that a detected change in the orientation of the user’s head may result in a corresponding change in the spatial audio rendering to reflect an orientation transformation of the user with reference to the virtual space in which the spatial audio data is rendered. The display of the VR display device 20 is described in more detail below.

The angular extent of the virtual environment observable through the VR display device 20 is called the visual field of view (FOV) of the display device 20. The actual FOV observed by a user depends on the inter-pupillary distance and on the distance between the lenses of the headset and the user’s eyes, but the FOV can be considered to be approximately the same for all users of a given display device when the display device is being worn by the user.

Figure 2a shows a typical VR system 1, comprising the above-described media player 10 and VR display device 20. A remote content provider 30 may store and transmit streaming video data which, in the context of embodiments, is VR video for display to the VR display device 20. Responsive to receive or download requests sent by the media player 10, the content provider 30 streams the VR data over a data network 40, which maybe any network, for example an IP network such as the Internet. Streaming may be by means of the MPEGDASH standard but it not limited to such.

The remote content provider 30 may or may not be the location or system where the VR video is captured and processed.

For illustration purposes, we may assume that the content provider 30 also captures, encodes and stores the VR content, as well as streaming it responsive to signals from the VR display system 1.

Referring to Figure 2b, an overview of a VR capture scenario 31 is shown together with a capturing, encoding and storing module 32 and an associated user interface 16. The Figure shows in plan-view a real world space 33 which may be for example a concert hall or other music venue. The capturing, encoding and storing module 32 is applicable to any real world space, however. A VR capture device 35 for video and spatial audio capture may be supported on a floor 34 of the space 33 in front of multiple audio sources 36,37, in this case

-8two musicians and associated instruments; the position of the VR capture device 35 is known, e.g. through predetermined positional data or signals derived from a positioning tag on the VR capture device. The VR capture device 35 in this example may comprise a microphone array configured to provide spatial audio capture.

As well as having an associated microphone or audio feed, the audio sources 36,37 may carry a positioning tag. A positioning tag may be any module capable of indicating through data its respective spatial position to the capturing, encoding and storing module 32. For example the positioning tag may be a high accuracy indoor positioning (HAIP) tag which works in association with one or more HAIP locators 38 within the space 33. HAIP systems use

Bluetooth Low Energy (BLE) communication between the tags and the one or more locators 38. For example, there maybe four HAIP locators mounted on, or placed relative to, the VR capture device 35. A respective HAIP locator maybe to the front, left, back and right of the VR capture device 35. Each tag sends BLE signals from which the HAIP locators derive the tag, and therefore, audio source location.

In general, such direction of arrival (DoA) positioning systems are based on (i) a known location and orientation of the or each locator, and (ii) measurement of the DoA angle of the signal from the respective tag towards the locators in the locators’ local co-ordinate system.

Based on the location and angle information from one or more locators, the position of the tag maybe calculated using geometry.

The capturing, encoding and storing module 32 is a processing system having an associated user interface (UI) 39 which may be used by an engineer or mixer to monitor and/or modify any aspect of the captured video and/or audio. As shown in Figure 2b, the capturing, encoding and storing module 32 receives as input from the VR capture device 35 spatial audio and video data, and positioning data, through a signal line 41. Alternatively, the positioning data may be received from the HAIP locator 38. The capturing, encoding and storing module 32 may also receive as input from each of the audio sources 36, 37 audio data and positioning data from the respective positioning tags, or the HAIP locator 38, through separate signal lines 42. The capturing, encoding and storing module 32 generates and stores the VR video and audio data for output to a user device 19, such as the VR system 1.

The input audio data maybe multichannel audio in loudspeaker format, e.g. stereo signals,

4.0 signals, 5.1 signals, Dolby Atmos (RTM) signals or the like. Instead of loudspeaker format audio, the input may be in the multi microphone signal format, such as the raw eight signal input from the Nokia OZO (RTM) VR camera, if used for the VR capture device 35.

The microphone signals can then be rendered to loudspeaker or binaural format for playback.

-9Associated with the capturing, encoding and storing module 32 is a streaming system 43, for example a streaming server. The streaming system 43 may be an entirely separate system from the capturing, encoding and storing module 32. Signal line 42 indicates an input received over the network 40 from the VR system 1. As will be explained, the VR system 1 indicates through such signalling (a) one or more video segments to be streamed dependent on position and/or orientation of the VR display device 20 within a corresponding VR space, and (b) one of a plurality of bit-rate streams for said segments, based on an estimated future position or movement. Reference numeral 45 indicates schematically the various selectable streams of different bit-rates.

In some embodiments, the video data streamed by the content provider 30 may represent portions or segments of an overall VR video scene, having for example a 360⁰ view field. Figure 3a is a schematic top plan view representing a 360⁰ view field 50 in relation to a user

55 wearing the VR display device 20. Based on the user’s position and the orientation of the

VR display device 20, only a current FOV 60 maybe streamed to the media player 10, i.e. the portions or segments of the video scene between the bounding lines 57. Position signals from the VR display device 20 are transmitted to the media player 10 which determines the FOV 60 and requests downloading of the segments corresponding to the FOV 60. When the segments corresponding to the FOV 60 are received, the media player 10 decodes and renders said segments to the VR display device 20.

The video data streamed by the content provider 30 may represent mono, stereo or other multi-view coding schemes.

Figure 3b shows, for example, a plurality of segments 80a - 8oh when rendered to the VR display device 20 from the user’s perspective. These may be termed “first segments” in that they represent the current FOV 60. Each segment 80a - 8oh is effectively a tile representing a respective two-dimensional region of the FOV 60 for a given time period or interval. The different segments 80a - 8oh may be aligned in time, so that switching between them is possible, although it is irrelevant what the switching time is. Each segment 80a - 8oh may represent video data lasting several seconds in length. Embodiments herein are not limited to any particular duration.

For the avoidance of doubt, the term segment used herein refers to video data representing a sub-portion of an overall image for a time interval.

- 10 The segments 8oa - 8oh may be streamed from the content provider 30 at a selected data rate, i.e. a first bit-rate (or digital bandwidth). The content provider 30 may provide a plurality of streams for the segments 80a - 8oh having respective bit-rates, in order that the media player 10 may select which stream to download. Usually, the media player 10 is configured to utilize substantially all bandwidth on the channel and hence most of the bandwidth may, by default, be assigned to the first segments 80a - 8oh in order that they be decoded and rendered in high resolution.

Data representing the remainder of the video scene, i.e. the background 70, may also be 10 downloaded using a lower, second bit-rate to provide a back-up in case the user changes position at a later time. The background 70 may be similarly represented by tiles or as a single lower-quality panorama. In the latter case, the temporal segments representing said panorama do not need to be aligned Background segments also cover a duration of a few seconds and segments typically start with a random-access point. There will be considerable latency if the media player 10 waits until the next segment boundary before switching to background segments to show the new viewpoint. Hence, conventionally, the media player 10 will try to download new segments (responsive to a change in position) at the same time as it renders current segments, and tries to catch up with the decoding and rendering of the new segments. However, because there is limited bandwidth, and most of that is being used for the first segments, downloading the new segments may be delayed and/or is likely to be of low visual quality, i.e. low resolution.

By way of illustration, this challenging situation is shown in Figure 4, which represents four sequential time frames Ti - T4. In a first time frame Ti, three segments 100 corresponding to several seconds of a current FOV are downloaded in sequence to the media player 10 at a first bit-rate. In a second time frame T2, another three segments no are downloaded in sequence, whilst the media player 10 concurrently decodes and renders the previous segments 100. In a third time frame T3, another three segments 120 are downloaded in sequence, whilst the media player 10 concurrently decodes and renders the previous segments no. However, a sudden change in user position (indicated by reference numeral 120) within time frame T3 will require one or more new segments 125 to be downloaded. Given the limited bandwidth, one of the current segments 120 may be delayed, effectively pausing playback until the necessaiy buffering is complete.

Embodiments herein provide methods and systems for mitigating this form of playback inefficiency in order to maintain or improve user experience.

- 11 In overview, embodiments involve estimating or predicting future movement (or a future FOV) and proactively downloading second video segments corresponding to at least some of the estimated future FOV in substantially the same time frame. In some embodiments, this may also involve selecting appropriate download speeds for the first and second video streams. For example, the first video segments (current FOV) may be downloaded at a lower bandwidth to enable the second video segments to be downloaded at a higher bandwidth than was previously the case (when said segments were considered background and hence downloaded at very low resolution.) The respective bandwidths may depend on the type, amount and/or nature of the movement.

Figure 5 is a schematic diagram of components of the media player 10. The media player 10 may have a controller 130, RAM 132, a memory 134, and, optionally, hardware keys 136 and a display 138. The media player 10 may comprise a network interface 140 for connection to the network 40, e.g. a modem which maybe wired or wireless. The media player 10 also comprises a wired or wireless port for transmitting and receiving signals with the VR display device 10. The input signals from the display device 10 will be position signals, indicative of user position/orientation and from which can be computed instantaneous and/or averaged movement over time. The output signals to the display device 10 will be the decoded and rendered video segment data. The controller 21 is connected to each of the other components in order to control operation thereof.

The memory 134 may be a non-volatile memory such as read only memory (ROM), a hard disk drive (HDD) or a solid state drive (SSD). The memory 134 stores, amongst other things, an operating system 142 and may store software applications 144. The RAM 132 is used by the controller 130 for the temporary storage of data. The operating system 142 may contain code which, when executed by the controller 130 in conjunction with the RAM 132, controls operation of each of the hardware components of the media player 10.

The controller 130 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor, or plural processors.

The media player 10 may be a standalone computer, a server, a console, or a network thereof. In some embodiments, the media player 10 may be provided as part of the VR display device 20. In such cases, both the media player 10 and VR display device 20 may be collectively referred to as a user device. The media player 10 may communicate with the content provider 30 and the VR display device 20 in accordance with one or more software applications 142 in accordance with the following steps, which includes estimation / prediction of future movement.

- 12 In some embodiments, the media player 10 may also be associated with external software applications not stored on the media player. These may be applications stored on a remote server device and may run partly or exclusively on the remote server device. These applications may be termed cloud-hosted applications. The media player 10 may be in communication with the remote server device in order to utilize the software application stored there.

One software application 142 provided on the memory 134 is for estimating or predicting 10 future movement of the VR display device 20, and for controlling which video segments to download at a given time frame. This may include initiating downloading of one or more second segments of video content representing at least a portion of the predicted future FOV.

Further, the software application 142 may use the estimated future movement to adapt the bit-rate(s) used for one or both of the first segments and the second segments, for example to maintain bandwidth within predetermined limits.

Figure 6 is a flow diagram indicating processing steps performed by the media player 10 under control of the software application 142.

The process starts in step 6.1 which indicates a first time frame, which can be any period of finite length.

In a next step 6.2, a current FOV is determined based on the position of the VR display device 25 20, i.e. using the position signals or data. In a next step 6.3, the media player 10 requests from the content provider 30 one or more first video segments (corresponding to the current FOV) and starts downloading at a first data rate. The first data rate may correspond to a high or medium bandwidth to produce a good subjective image when rendered.

In a next step 6.4, the application 142 estimates future movement (and therefore, potentially, a future FOV) which typically takes into account prior movement measured over a finite time period. This measurement may be performed relatively frequently, for example thirty times per second. In a next step 6.5, if a significant amount of movement is detected, for example above a predetermined threshold, then in a next step 6.6 the first data rate may be adjusted,

e.g. by selecting a lower bandwidth stream for the first video segments so that downloading may continue at the adjusted rate in step 6.3 for the next first video segments for the current FOV. If below the threshold, then the process returns to step 6.3 without changing the first data rate. In a next step 6.7, which is also responsive to a positive outcome from step 6.5, and

-13which maybe performed in parallel with step 6.6, the next (second) video segments to download are determined based on the estimate in step 6.4. In a subsequent step 6.8 the media player 10 requests and downloads from the content provider 30 one or more second video segments for downloading at a second data rate. In a step 6.9, the received second video segments are stored or buffered locally at the media player 10, for example in cache memory. Effectively, in the event of predicted future motion, we adjust the first data rate to reserve some bandwidth for quickly downloading second video segments.

The Figure 6 process repeats for subsequent time frames.

In some embodiments, the decision on which data rates to use may be performed relatively infrequently, for example only when the segment cache or buffer is approaching an empty state so that a new download needs to be initiated.

Figure 7 is a flow diagram indicating further processing steps performed by the media player 10 under control of the software application 142. The Figure 7 process maybe performed in parallel with the Figure 6 process.

In a first step 7.1, movement of the VR display device 20 is detected. In a next step 7.2, the corresponding FOV may be determined, i.e. the new FOV. In a next step 7.3, if the new FOV covers any of the previously buffered second video segments from step 6.8, then said segments are retrieved, decoded and rendered locally in steps 7.5 and 7.6. In other words, if the estimated movement from step 6.4 corresponds to some degree with the actual later movement, then at least some (and possibly all) of the second video segments are available locally and can be displayed at good resolution with little or no delay. If the estimated movement from step 6.4 was incorrect, then the required second video segments need to be downloaded as before from the content provider 30 at low quality (step 7.4.)

Referring back to steps 6.3 and 6.6, the bit-rate variants are selected in order to be able to download and decode both current and the future segments simultaneously. The bit-rate and quality of both sets are however not dropped down to background quality, as is conventionally the case. The selected bit-rate variants may depend on the type of movement.

For example, if there is continuous movement in general, as indicated in Figure 8, medium bit-rate streams may be selected for both the first video segments and the second video segments. In this case, there is a good to high expectation that a change of FOV is approaching, and what it will be, and hence we need to preserve bandwidth for both sets of segments.

-14For example, if there is little or no head motion, a higher bit-rate variant may be selected for the first video segments for rendering in high quality, since there is a smaller likelihood of second video segments being required imminently.

Figure 9 is a top plan view of the Figure 3a space, showing how the FOV 60’ is changed based on the Figure 8 movement, and how a future FOV 61 is estimated.

In some embodiments, the media player 10 may begin downloading segments for a future

FOV based on predicted motion. This may not be tied to the normal download cycle described with reference to Figure 6. For example, the sub-process of step 6.4 may initiate downloading new segments within a typical download initiation time frame.

Referring to Figure 10, a further embodiment will now be described. As will be appreciated, in order to ensure smooth playback, the media player 10 needs to cache or buffer the tile segments. However, in order to reduce latency when switching tiles after head motion, and reduce bandwidth usage, the caching should be minimised. If many segments are downloaded at the same time in a bandwidth-limited network, parallel downloads may affect each other and get completed in a random order.

Therefore, embodiments herein further propose a prioritised download order based on the predicted motion.

A plurality of tile segments 150a - I5oh are shown in Figure 10. The box 160 indicates a current FOV and the arrow the predicted motion, which is a head motion towards the left.

The media player 10 may detect that the cache is becoming empty and new segments are needed. However, if the estimated head motion does not yet trigger the downloading of new tile segments, e.g. segments 150a, 150b, then segments need to be cached for tile segments 150c - I5oh in the current FOV 160.

To effect this, the media player 10 prioritizes the download order for the current tile segments 150c - 150ή based on the motion, for example by first downloading segments for tiles in the direction of movement. In the shown example, the first segments that are cached will be those on the left-hand side, i.e. 150c, isod, followed by the next set 150ε, isof and finally i50g, i5oh. The justification is that, because motion is to the left, the tiles i50g, 150ή on the right hand side are not expected to be visible first.

If no motion is detected, however, we may prioritise the central tile segments, i.e. for 150ε, i50f.

-15As motion is predicted iteratively, e.g. 30 times per second, if the motion changes, the media player 10 may fine-tune the prioritization for tile segments that have not yet started downloading. For example, if motion continues in the predicted direction, it maybe that we do not need to download segments for tiles isog and isoh at all.

Estimation of future movement maybe determined using various methods. For example, the VR user device 20 maybe tracked, e.g. 30 to 60 times per second by measuring one or more of the yaw, pitch and roll parameters. Each time a new position is measured, the following steps may be performed.

First, the displacement between the current and previously measured position can be determined. It maybe referred to as a motion vector, e.g. in the yaw-pitch domain. The previous N motion vectors may be stored in a First In First Out (FIFO) array. N may be ten, for example. Next, the total motion for the previous N samples may be determined as the sum of the stored motion vectors. If the total motion vector is towards a particular direction, e.g. the left or right, and has a length greater than a predetermined threshold Ti, we may predict that the head will continue moving in that direction. Hence we may predict the future FOV, and determine which segments to download next.

In some embodiments, if the total motion vector is greater than a different, higher threshold T2, for example indicating that the user is turning their head fast, the media player 10 may switch to a medium data rate mode. This is because we can expect the future FOV to be unsteady but motion may continue. This is also because the user is unlikely to perceive fine details if moving quickly.

In some embodiments, if the motion vectors in the FIFO indicate inconsistent motion, then the length of the total motion vector may be smaller than Ti, but with a variance of individual vectors greater than a further threshold, T3 (where the vectors partially eliminate each other by pointing in opposite directions.) In this case, we may predict that the user is randomly turning their head back and forth, and generally it may be preferred not to use this data for predicting future movement. We may in this case continue with downloading the current segments at a medium data rate.

In some embodiments, if the total motion vector length is below the threshold Ti, and the variance of the individual vector lengths is below T3 (indicating relatively little and/or steady motion for which the user will expect to see high quality video) the media player may select

-16 the stream with the highest possible data rate for the future segments, subject to network throughput.

In some embodiments, machine learning models, e.g. neural networks, may be employed to 5 predict future movement, taking previous movements into account.

In some embodiments, the steps of determining the future movement and/or the future FOV may be performed remotely from the VR display system l, for example at the content provider 30 or even a separate device, e.g. a cloud device.

For example, the position / movement data generated by the VR display system 1 responsive to movement of the VR headset 20 may be transmitted over the network 40 to such a remote device which estimates future movement and therefore the second video segments to transmit to the VR display system for buffering. Said remote device may also determine the bit-rate at which the respective first and second video segments are to be transmitted to the VR display system 1.

In summary, it will be appreciated that the methods and systems disclosed herein provide a way of reducing latency in providing new video segments to a VR display system, responsive to a change of user view. Rather than downloading background video at a very low bandwidth as a fall back to possible movement, the methods and system proactively download (or cause downloading of) future video segments based on an estimated future movement and may adjust the bandwidths of current and future segments appropriately.

It will be appreciated that the above described embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present application.

Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

Claims

Claims

1. A method comprising:

determining a current field-of-view of a user device;

5 displaying to the user device one or more video segments representing respective subareas of a larger video image for showing at least part of the current field-of-view;

estimating future movement of the user device to determine a future field-of-view; requesting from a remote device one or more further video segments representing at least part of the future field-of-view; and

10 receiving and storing the further video segments for display to the user device.
2. The method of claim 1, wherein the current field-of-view is represented by one or more first video segments, and the future field-of-view is represented by one or more second video segments, one or more of which corresponds to a different sub-area of the larger video

15 image.
3. The method of claim 2, further comprising displaying at least some of the stored second video segments to the user device if subsequent movement corresponds to the estimated future movement.
4. The method of claim 2 or claim 3, wherein the estimated future movement is estimated based on prior movement of the user device measured over a finite time period.
5. The method of claim 4, wherein the estimated future movement is estimated based on 25 one or both of direction of movement and speed of movement.
6. The method of any preceding claim, wherein the respective data-rate(s) used for receiving the first and/or second video segments is/are adjusted based on movement.

30
7. The method of claim 6, wherein the respective data-rate(s) is/are adjusted to maintain the overall bandwidth within a predetermined maximum level.
8. The method of claim 6 or claim 7, wherein a first data-rate is used for receiving the first video segments if movement is below a first threshold, and a second, lower data-rate is

35 used for said first video segments if movement is above said first threshold.
9. The method of claim 8, wherein the respective data-rates for receiving the first and second video segments is substantially the same if movement is above said first threshold.

-Ιδιο. The method of any of claims 6 to 9, wherein the user device selects one of a plurality of bit-streams, each for providing the first and second video segments at different data rates, for causing the first and second video segments to be received at said selected data rate.
11. The method of claim 1, wherein in the event that the one or more further video segments represent the same respective sub-areas as those of the current field-of-view, the requesting of the one or more further video segments is prioritised based on the estimated movement.
12. The method of claim 11, wherein the requesting of the one or more further video segments is prioritised such that the segment(s) in the direction of estimated movement is or are requested first.

15
13. The method of any preceding claim, wherein the user device is a wearable user device.
14. The method of claim 13, wherein the user device is a headset, e.g. a VR headset.
15. The method of any preceding claim, wherein the video segments are encoded using

20 mono, stereo or multi-view coding.
16. A computer program comprising instructions that when executed by a computer program control it to perform the method of any preceding claim.

25
17. A non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising:

determining a current field-of-view of a user device;

displaying to the user device one or more video segments representing respective sub30 areas of a larger video image for showing at least part of the current field-of-view;

estimating future movement of the user device to determine a future field-of-view; requesting from a remote device one or more further video segments representing at least part of the future field-of-view; and receiving and storing the further video segments for display to the user device.
18. An apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:

-19to determine a current field-of-view of a user device;

to display to the user device one or more video segments representing respective sub areas of a larger video image for showing at least part of the current field-of-view;

to estimate future movement of the user device to determine a future field-of-view;

5 to request from a remote device one or more further video segments representing at least part of the future field-of-view; and to receive and store the further video segments for display to the user device.
19. An apparatus configured to perform the method of any of claims 1 to 15.

Intellectual

Property

Office

Application No: GB1704887.7 Examiner: Dr Fabio Noviello