WO2019034804A2

WO2019034804A2 - Three-dimensional video processing

Info

Publication number: WO2019034804A2
Application number: PCT/FI2018/050435
Authority: WO
Inventors: Kimmo Roimela; Mika Pesonen; Johannes Rajala; Johannes PYSTYNEN
Original assignee: Nokia Technologies Oy
Priority date: 2017-08-14
Filing date: 2018-06-11
Publication date: 2019-02-21
Also published as: GB201712975D0; WO2019034804A3; GB2566006A

Abstract

A method and system for three-dimensional video content processing is disclosed comprising an operation of receiving a first data stream representing a plurality of frames of three- dimensional foreground video content captured from a first location. Another operation involves storing the foreground video content in a first memory. Another operation involves receiving a second data stream representing control data associated with the first data stream, the control data indicating one or more selected regions of the received video content. Another operation involves storing a copy of the one or more selected regions in a second memory. Another operation involves identifying one or more background regions not represented in the first memory. Another operation involves providing background video content for the one or more background regions using at least part of the selected video content stored in the second memory. A method and system is also disclosed for providing the first and second data streams.

Description

Three-Dimensional Video Processing

Field of the Invention

This invention relates to methods and systems for three-dimensional video processing, for example in virtual reality applications.

Background of the Invention

Virtual reality (VR) is a rapidly developing area of technology in which video content is provided, e.g. streamed, to a VR display system. A VR display system may be provided with a live or stored feed from a video content source, the feed representing a VR space or world for immersive output through the display system. In some embodiments, audio is provided, which may be spatial audio. A virtual space or virtual world is any computer-generated version of a space, for example a captured real world space, in which a user can be immersed through a display system such as a VR headset. A VR headset may be configured to provide VR video and audio content to the user, e.g. through the use of a pair of video screens and headphones incorporated within the headset.

The VR feed may comprise data representing a plurality of frames of three-dimensional (3D) video content which provides a 3D representation of a space and/or objects which appear to have depth when rendered to the VR headset. The 3D video content may comprise a colour (e.g. RGB) stream and a corresponding depth stream indicating the depth information for different parts of the colour stream.

Position and/or movement of the user device can enhance the immersive experience. VR headsets use so-called three degrees of freedom (3D0F) which means that the head movement in the yaw, pitch and roll axes are measured and determine what the user sees.

This facilitates the scene remaining largely static in a single location as the user rotates their head. A next stage may be referred to as 3D0F+, which may facilitate limited translational movement in Euclidean space in the range of, e.g. tens of centimetres, around a location. A yet further stage is a six degrees of freedom (6D0F) VR system, where the user is able to freely move in the Euclidean space and rotate their head in the yaw, pitch and roll axes.

6D0F VR systems and methods will enable the provision and consumption of volumetric VR content. Volumetric VR content comprises data representing spaces and/or objects in three- dimensions from all angles, enabling the user to move fully around the spaces and/or objects to view them from any angle. For example, a person or object may be fully scanned and reproduced within a real-world space. When rendered to a VR headset, the user may 'walk around' the person or object and view them from the front, the sides and from behind.

Summary of the Invention

A first aspect of the invention provides a method comprising: receiving a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location; storing the foreground video content in a first memory; receiving a second data stream representing control data associated with the first data stream, the control data indicating one or more selected regions of the received video content for providing one or more background regions; storing a copy of the one or more selected regions in a second memory; and providing background video content for one or more background regions not represented in the foreground video content using at least part of the selected video content stored in the second memory. The rendered foreground and background video content may be stored in an output buffer for outputting to a user viewing device as a buffered sequence of frames.

The rendering of the background video content may be based on background rendering instructions provided in the control data.

The method may further comprise receiving positional data indicative of user movement and wherein the rendering of the background video content comprises identifying one or more newly-visible regions based on the positional data and rendering the background video content corresponding to the one or more newly- visible regions.

The positional data may be received from a user viewing device and wherein the newly-visible regions correspond to regions occluded from the user's viewing perspective.

Identifying the one or more newly-visible regions may comprise identifying regions having no video content.

The one or more selected regions may comprise one or more predetermined polygons.

The method may further comprise applying or associating a modification to the one or more selected regions. The modification may comprise one or more of scaling, transforming, animating and modifying the depth of the one or more selected regions.

The control data may further indicate the modification to be applied or associated to each of the one or more regions.

The first and second data streams may be received simultaneously.

The second data stream may comprise control data associated with each frame of foreground video content.

The first data stream may comprise a first sub-stream representing colour foreground data and a second sub-stream representing depth information associated with the colour foreground data.

The second memory may be a persistent colour plus depth buffer that is managed over multiple frames according to the control data.

The second memory may use double buffering.

The method may be performed at a media playing device associated with a user viewing device.

A second aspect of the invention provides a method comprising: providing a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location; providing a second data stream representing control data associated with the first data stream, the control data including: one or more instructions for copying one or more selected regions of the provided foreground video content; and one or more instructions for providing background video content for one or more background regions not represented in the first data stream using at least part of the selected one or more regions of the foreground video content which are instructed to be copied.

The one or more instructions for providing the background video content may comprise rendering instructions for the background video content for output to a user viewing device. The rendering instructions may further indicate how the background video content is to be rendered based on positional data indicative of user movement which reveals one or more newly visible regions. The one or more instructions for copying one or more selected regions of the provided foreground video content may comprise identifying one or more predetermined polygons.

The control data may further comprise one or more instructions for applying or associating a modification to the one or more selected regions of the foreground video content.

The modification instructions may comprise one or more of scaling, transforming, animating and modifying the depth of the one or more selected regions of the foreground video content.

The first and second data streams may be transmitted simultaneously.

The second data stream may comprise control data associated with each frame of video content.

The method may be performed at a content provider system configured to send the first and second data streams to one or more remote media playing devices.

A third aspect of the invention provides a computer program comprising instructions that when executed by a computer control it to perform the method of any preceding definition.

A fourth aspect of the invention provides an apparatus configured to perform the method steps of any of preceding method definition.

A fifth aspect of the invention provides a non-transitory computer-readable medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising: receiving a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location; storing the foreground video content in a first memory; receiving a second data stream representing control data associated with the first data stream, the control data indicating one or more selected regions of the received video content for providing one or more background regions; storing a copy of the one or more selected regions in a second memory; and providing background video content for one or more background regions not represented in the foreground video content using at least part of the selected video content stored in the second memory.

A sixth aspect of the invention provides a non-transitory computer-readable medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising: providing a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location; providing a second data stream representing control data associated with the first data stream, the control data including: one or more instructions for copying one or more selected regions of the provided foreground video content; and one or more instructions for providing background video content for one or more background regions not represented in the first data stream using at least part of the selected regions of the foreground video content which are instructed to be copied.

A seventh aspect of the invention provides an apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor: to receive a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location; to store the foreground video content in a first memory;

to receive a second data stream representing control data associated with the first data stream, the control data indicating one or more selected regions of the received video content for providing one or more background regions; to store a copy of the one or more selected regions in a second memory; and to provide background video content for one or more background regions not represented in the foreground video content using at least part of the selected video content stored in the second memory. An eighth aspect of the invention provides an apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor: to provide a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location; to provide a second data stream representing control data associated with the first data stream, the control data including: one or more instructions for copying one or more selected regions of the provided foreground video content; and one or more instructions for providing background video content for one or more background regions not represented in the first data stream using at least part of the selected regions of the foreground video content which are instructed to be copied.

Brief Description of the Drawings

The invention will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:

Figure l is a perspective view of a VR display system;

Figure 2 is a block diagram of a computer network including the Figure l VR display system, according to embodiments of the invention;

Figure 3 is a side view of a multi-camera device for capturing panoramic images;

Figure 4 is a schematic diagram of an example VR capture scenario and a content provider system, according to embodiments of the invention;

Figure 5 is a representational view of one frame of panoramic video resulting from the Figure 4 scenario;

Figure 6 is a representational view of a subsequent frame of panoramic video, indicative of a user moving their head to one side;

Figure 7 is a schematic diagram of components of a content provider system shown in Figure 4;

Figure 8 is a flow diagram showing processing steps performed at the content provider system of Figure 4, according to embodiments of the invention;

Figure 9 is a flow diagram showing processing steps performed at the content provider system of Figure 4 for generating control data for the method of Figure 8;

Figure 10 is a schematic diagram showing part of the Figure 4 content provider system and different streams of data, according to embodiments of the invention;

Figure 11 is a schematic diagram of components of a media player shown in Figure 2;

Figure 12 is a flow diagram showing processing steps performed at the media player of Figure

11, according to embodiments of the invention;

Figure 13 is a flow diagram showing further processing steps performed at the media player of Figure 11, according to embodiments of the invention;

Figure 14 is a block diagram showing functional modules and processes involved in the Figure 12 or Figure 13 methods, according to embodiments of the invention; and

Figures 15a - lse are graphical representations which are useful for understating different stages of the Figure 12 or Figure 13 methods, according to embodiments of the invention. Detailed Description of Preferred Embodiments

Embodiments herein relate to processing video content, particularly three-dimensional (3D) video content. For example, the 3D video content may be virtual reality (VR) video content, representing a plurality of frames of VR data for output to a VR headset or a similar display system. The 3D video content may represent panoramic video content.

Such methods and systems are applicable to related technologies, including Augmented Reality (AR) technology and panoramic video technology.

Video content is represented by video data in any format. The video data may be captured and provided from any image sensing apparatus, for example a single camera or a multi- camera device, e.g. Nokia's OZO camera. The methods and systems described herein are applicable to video content captured by, for example, monoscopic cameras, stereoscopic cameras, 360 degree panoramic cameras and other forms of VR or AR camera.

In some embodiments, the captured video data may be stored remotely from the one or more users, and streamed to users over a network. The network may be an IP network such as the Internet. The video data may be stored local to the one or more users on a memory device, such as a hard disk drive (HDD) or removable media such as a CD-ROM, DVD or memory stick. Alternatively, the video data may be stored remotely on a cloud-based system.

In embodiments described herein, the video data is stored remotely from one or more users at a content server. The video data is streamed over an IP network to a display system associated with one or more users. The data stream of the video data may represent one or more VR spaces or worlds for immersive output through the display system. In some embodiments, audio may also be provided, which may be spatial audio.

Figure 1 is a schematic illustration of a VR display system 1 which represents user-end equipment. The VR display system 1 includes a user device in the form of a VR headset 20 for displaying video data representing a VR space, and a VR media player 10 for rendering the video data on the VR headset 20. In some embodiments, a separate user control (not shown) may be associated with the VR display system 1, e.g. a hand-held controller.

In the context of this specification, a virtual space or world is any computer-generated version of a space, for example a captured real world space, in which a user can be immersed. It may comprise one or more objects. The VR headset 20 may be of any suitable type. The VR headset 20 may be configured to provide VR video and audio content data to a user. As such, the user may be immersed in virtual space.

The VR headset 20 receives the VR video data from a VR media player 10. The VR media player 10 may be part of a separate device which is connected to the VR headset 20 by a wired or wireless connection. For example, the VR media player 10 may include a games console, or a PC configured to communicate visual data to the VR headset 20.

Alternatively, the VR media player 10 may form part of the VR headset 20.

Here, the VR media player 10 may comprise a mobile phone, smartphone or tablet computer configured to play content through its display. For example, the VR media player 10 may be a touchscreen device having a display over a major surface of the device, through which video content can be displayed. The VR media player 10 may be inserted into a holder of a VR headset 20. With such VR headsets 20, a smart phone or tablet computer may display the video data which is provided to a user's eyes via respective lenses in the VR headset 20. The VR display system 1 may also include hardware configured to convert the device to operate as part of VR display system 1. Alternatively, the VR media player 10 may be integrated into the VR headset 20. The VR media player 10 may be implemented in software, hardware, firmware or a combination thereof. In some embodiments, a device comprising VR media player software is referred to as the VR media player 10.

The VR display system 1 may include means for determining the spatial position of the user and/or orientation of the user's head. This may be by means of determining the spatial position and/or orientation of the VR headset 20. Over successive time frames, a measure of movement may therefore be calculated and stored. Such means may comprise part of the VR media player 10. Alternatively, the means may comprise part of the VR headset 20. For example, the VR headset 20 may incorporate motion tracking sensors which may include one or more of gyroscopes, accelerometers and structured light systems. These motion tracking sensors generate position data from which a current visual field-of-view (FOV) is determined and updated as the user, and so the VR headset 20, changes position and/or orientation. The VR headset 20 will typically comprise two digital screens for displaying stereoscopic video images of the virtual world in front of respective eyes of the user, and also two speakers for delivering audio, if provided from the VR media player 10. The embodiments herein, which primarily relate to the delivery of VR content, are not limited to a particular type of VR headset 20. In some embodiments, the VR display system l may include means for determining the gaze direction of the user. In some embodiments, gaze direction may be determined using eye tracking sensors provided in the VR headset 20. The eye tracking sensors may, for example, be miniature cameras installed proximate the video screens which identify in real-time the pupil position of each eye. The identified positions may be used to determine which part of the current visual FOV is of interest to the user. This information can be used for example to identify one or more sub-sets of content within the video data, e.g. objects or regions projected at a particular depth within the content. For example, the convergence point of both eyes may be used to identify a reference depth.

The VR display system 1 may be configured to display VR video data to the VR headset 20 based on spatial position and/or the orientation of the VR headset. A detected change in spatial position and/or orientation, i.e. a form of movement, may result in a corresponding change in the visual data to reflect a position or orientation transformation of the user with reference to the space into which the visual data is projected. This allows VR content data to be consumed with the user experiencing a stereoscopic or 3D VR environment.

Audio data may also be provided to headphones provided as part of the VR headset 20. The audio data may represent spatial audio source content. Spatial audio may refer to directional rendering of audio in the VR space or world such that a detected change in the user's spatial position or in the orientation of their head may result in a corresponding change in the spatial audio rendering to reflect a transformation with reference to the space in which the spatial audio data is rendered. The angular extent of the environment observable through the VR headset 20 is called the visual field of view (FOV). The actual FOV observed by a user depends on the inter-pupillary distance and on the distance between the lenses of the VR headset 20 and the user's eyes, but the FOV can be considered to be approximately the same for all users of a given display device when the VR headset is being worn by the user.

Referring to Figure 2, a remote content provider 30 may store and transmit streaming VR content data for output to the VR headset 20. Responsive to receive or download requests sent by the VR media player 10, the content provider 30 streams the VR data over a data network 15, which may be any network, for example an IP network such as the Internet.

The remote content provider 30 may or may not be the location or system where the VR video is captured, created and/or processed. For illustration purposes, we may assume that the content provider 30 also captures, encodes and stores the VR content.

Referring to Figure 3, an example VR capturing device is in the form of a multi-camera system 31. The multi-camera system 31 comprises a generally spherical body 32 around which are distributed a plurality of video cameras 33. For example, eight video cameras 33 may be provided, each having an approximate 195⁰ field-of-view. The multi-camera system

31 may therefore capture 360⁰ panoramic images by stitching images from the individual cameras 33 together, taking into account overlapping regions. Nokia's OZO camera is one such example. Multiple microphones (not shown) may also be distributed around the body

32 for capturing spatial audio.

Referring to Figure 4, an overview of a VR capture scenario 34 is shown. The VR capture scenario 34 is shown together with a post-processing module 35 and an associated user interface 44. Figure 4 shows in plan-view a real world space 36 which may be an indoors scene, an outdoors scene, a concert, a conference or indeed any real-world situation. The multi-camera system 31 may be supported on a floor 37 of the real-world space 36 in front of first to fourth objects 38, 39, 40, 41. The first to fourth objects 38, 39, 40, 41 may be static objects or they may move over time. One or more of the first to fourth objects 38, 39, 40, 41 may be a person, an animal, a natural or geographic feature, an inanimate object, a celestial body etc. One or more of the first to fourth objects 38, 39, 40, 41 may generate audio, e.g. if the object is a singer, a performer or a musical instrument. A greater or lesser number of objects may be present. The position of the multi-camera system 31 may be known, e.g. through predetermined positional data or signals derived from a positioning tag on the VR capture device.

One or more of the first to fourth objects 38, 39, 40, 41 may carry a positioning tag. A positioning tag may be any module capable of indicating through data its respective spatial position to the post-processing module 35. For example a positioning tag may be a high accuracy indoor positioning (HAIP) tag which works in association with one or more HAIP locators within the space 36. HAIP systems use Bluetooth Low Energy (BLE) communication between the tags and the one or more locators. For example, there may be four HAIP locators mounted on, or placed relative to, the multi-camera system 31. A respective HAIP locator may be to the front, left, back and right of the multi-camera system 31. Each tag sends BLE signals from which the HAIP locators derive the tag, and therefore, audio source location. In general, such direction of arrival (DoA) positioning systems are based on (i) a known location and orientation of the or each locator, and (ii) measurement of the DoA angle of the signal from the respective tag towards the locators in the locators' local co-ordinate system. Based on the location and angle information from one or more locators, the position of the tag may be calculated using geometry.

The position of the first to fourth objects 38, 39, 40, 41 may be determined using a separate camera system.

The post-processing module 35 is a processing system, possibly having an associated user interface (UI) 44 which may be used for example by an engineer or mixer to monitor, modify and/or control any aspect of the captured video and/or audio. Embodiments herein also enable provision and editing of control data for association with captured video data to enable one or more occluded regions to be represented, as will be explained later on.

As shown in Figure 4, the post-processing module 35 receives as input from the multi-camera system 31 spatial video data (and possibly audio data) and positioning data, through a signal line 42. Alternatively, the positioning data may be received from a HAIP locator. The post- processing module 35 may also receive as input from one or more of the first to fourth objects 38, 39, 40, 41 audio data and positioning data from respective positioning tags through separate signal lines. The post-processing module 35 generates and stores the VR video and audio data for output to a user device, such as the VR system 1 shown in Figures 1 and 2, via a signal line 47.

The input audio data may be multichannel audio in loudspeaker format, e.g. stereo signals, 4.0 signals, 5.1 signals, Dolby Atmos (RTM) signals or the like. Instead of loudspeaker format audio, the input may be in the multi microphone signal format, such as the raw eight signal input from the Nokia OZO (RTM) VR camera, if used for the multi-camera system 31. The microphone signals can then be rendered to loudspeaker or binaural format for playback.

Associated with the post-processing module 35 is a streaming system 43, for example a streaming server. The streaming system 43 may be part of, or an entirely separate system from, the post-processing module 35. Signal line 45 indicates an input received over the network 15 from the VR system 1. The VR system 1 indicates through such signalling the data to be streamed dependent on position and/or orientation of the VR display device 20. It will be appreciated that the video data captured by the multi-camera system 31 may represent objects positioned at different respective distances from the multi-camera system. For example, the first, second and third objects 38, 39, 40 are located at different respective distances di, d₂, d₃ from the multi-camera system 31.

However, the fourth object 41 is located behind the first object 38 and hence is occluded from the multi-camera system 31. Therefore, the captured video data will not include any representation of the fourth object 41. The captured video data may subsequently be processed so that the rendered video data, when output to the VR display device 20, simulates the captured content at the respective depth planes. That is, when processed into stereoscopic video data (with slightly differing images being provided to the respective screens of the VR display device 20) the first to third objects 38, 39, 40 will appear at to be at their respective distances di, d₂, d₃ from the user's perspective. This is illustrated graphically in Figure 5, which shows a single frame 50 of panoramic content based on the Figure 4 scenario 34.

The captured video data is referred to hereafter as "foreground video content." In providing the foreground video content to one or more users, the post-processing module 35 is arranged to generate first and second data sets respectively comprising the colour information of the foreground video content, and associated depth information indicative of the respective depths of pixels or pixel regions of said content. Using this information, the VR display system 1 can render and display the foreground video content in 3D.

For example, the first data set may comprise RGB video data. The second data set may comprise depth information D in any conventional format. The depth information D may be generated using any known method, for example using a LiDAR sensor to generate a two- dimensional depth map. Alternatively, a depth map may be generated using a stereo-pair of images.

In the shown example, the depth information D may include data representative of di, d₂, d₃ shown in Figure 4, as well as other depth information for other pixels of the foreground video content.

The first and second data sets RGB and D are provided in a form suitable for streaming on a frame-by-frame basis by the streaming system 43. For example, the first and second data sets may be termed RGB-D data sets, a term used in some known systems. The first and second data sets may be streamed simultaneously.

The foreground video content will not include any representation of the fourth object 41. During consumption of the foreground video content at the VR display system 1, if a user moves to some degree, for example by moving their head to one side, the occluded region behind the foreground video content may be revealed. Without any video data for the revealed region, a space or invalid pixel data will appear unless further processing is performed. Figure 6, for example, shows a subsequent frame seen by a user wearing the VR headset 20 if they move their head leftwards. Three regions 52, 53, 54 are revealed, which regions are not represented in the foreground video content.

For a given frame, one or more regions not represented in the foreground video content is or are referred to as background regions.

Embodiments herein provide methods and systems for generating one or more background regions without the need to separately transmit multiple layers, including layers for the background behind the foreground video content, which would require additional decoding and rendering processing at playback time.

Rather, embodiments comprise transmitting control data with the foreground video content data. The control data may be metadata. The control data may comprise a relatively small amount of additional or augmenting data, requiring far less bandwidth than would be required for transmitting multiple layers.

For example, the control data may be associated with each frame of foreground video content data. In some embodiments, the control data may be provided at the post-processing module 35 of the content provider 30 and streamed by means of the streaming system 43 to the VR display system 1. At the VR display system 1, the received control data may be used to create background video content data corresponding to the background regions. The foreground and background regions may then be rendered and combined for output to the VR display device 20.

The control data may be authored, and subsequently edited prior to transmitting the foreground video content. The VR display system 1 receiving the control data may then dynamically generate the background regions to fill-in the occluded regions. Referring to Figure 7, components of the post-processing module 35 are shown. The postprocessing module 35 may comprise a controller 61, RAM 63, a memory 65, and, optionally, hardware keys 67 and a display 69. The post-processing module 35 may comprise a network interface 71, which may be a data port for connecting the system to the network 15 or the streaming module 43.

The network interface 71 may additionally or alternatively comprise a radiofrequency wireless interface for transmitting and/or receiving the post-processed data using a wireless communications protocol, e.g. WiFi or Bluetooth. An antenna 73 may be provided for this purpose.

The controller 61 may receive captured RGB video data from the multi-camera system 31 which represents the foreground video data for successive frames. The controller may also receive depth information, e.g. a depth map. A depth map may associated with successive frames of the foreground video data.

One or more control signals may be provided from the controller 61 to the multi-camera system 31. The memory 65 may be a non-volatile memory such as read only memory (ROM), a hard disk drive (HDD) or a solid state drive (SSD). The memory 65 stores, amongst other things, an operating system 74 and may store software applications 75. The RAM 63 is used by the controller 61 for the temporary storage of data. The operating system 74 may contain code which, when executed by the controller 61 in conjunction with the RAM 63, controls operation of each of the hardware components of the post-processing system 35.

The controller 61 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor, or plural processors. In some embodiments, the post-processing system 35 may also be associated with external software applications not stored on the camera. These may be applications stored on a remote server device and may run partly or exclusively on the remote server device. These applications may be termed cloud-hosted applications. The camera 1 may be in communication with the remote server device in order to utilize the software application stored there. Figure 8 is a flow diagram indicating steps performed by one or more software applications in accordance with embodiments. For example, Figure 8 may represent steps performed by the software application 75 in Figure 7, which is a post-processing application. It should be appreciated that certain ones of the steps shown in Figure 8 can be re-ordered. The numbering of steps is not necessarily indicative of their required order of processing.

In a first step 8.1, foreground video content data, which may be in the form of RGB data and corresponding depth data D for successive frames is provided, for example received from the multi-camera system 31. In a subsequent step 8.2, control data for the foreground video content data is provided, for example received from user input or from another data source. Steps 8.1 and 8.2 may be performed in reverse order or at the same time. In a subsequent step 8.3, the foreground video content data and the control data are sent to one or more user- end systems, for example the VR display system 1. This may be by means of streaming, which may be performed for both data sets simultaneously.

The step 8.2 of providing the control data may comprising receiving control data from, or generated using input from, the user interface 44. For example, the control data may be authored by a director or editor using the user interface 44. Figure 9 is a flow diagram indicating steps performed by one or more software applications for generating the control data provided in step 8.2, in accordance with embodiments. A first step 9.1 comprises receiving foreground copy control data. A subsequent step 9.2 comprises receiving modification control data. A subsequent step 9.3 comprises receiving background render control data. A subsequent step 9.4 comprises generating the control data. Certain steps may be omitted and/or re-ordered. For example, steps 9.1, 9.2 and 9.3 may be performed in any order. For example, step 9.2 may be considered optional.

The control data, for example each of the foreground copy control data, modification control data and background render control data, are decodable by the VR display system 1 to update one or more background regions on a frame-by-frame basis.

In overview, the foreground copy control data may indicate one or more regions of the foreground video content, e.g. the RGB video content and depth information D, to copy into a first memory of the VR display system 1, such as a local cache. The foreground copy control data may comprise one or more copy commands, and multiple regions/commands may be issued per frame. The foreground copy control data may copy different shaped regions of the foreground video content, for example one or more of rectangles, triangles, squares, circles and arbitrary polygons. The foreground copy control data may refer to frames other than the current frame.

The modification control data may indicate one or more modifications to be applied by the VR display system 1 to one or more of the foreground regions copied to the first memory of the VR display system 1. For example, the modification control data may comprise one or more commands for moving and/or reshaping the copied foreground regions, and/or changing the depth information to account for 3D motion of background objects. Example modification control data commands include, but are not limited to: modifying a source rectangle (x srcj ysrc, Wsrc, i^src ) into a destination rectangle (xdst, ydst, w<ist, hdst) with an optional constant depth offset d; this enables both scaling and movement in 3D; modifying a source triangle (xo_src, yo_src, xisrc, yisrc, x2_src, y2_src) into a destination triangle (xOdst, yOdst, xidst, yidst, x2dst, y2d_st) with optional, per-vertex depth offsets do...di...d2; this enables motion of more refined regions as well as approximation of rotating 3D objects.

The background render control data may indicate one or more ways in which the VR display system 1 may fill-in or combine the copied and, where applicable, modified foreground regions into the background regions. In other words, any revealed background regions, e.g. due to user movement, may be filled with parts of the copied and modified foreground regions according to the background render control data.

This enables performance optimisations by limiting the background rendering workload to only the content necessary for each frame.

The control data is preferably generated in a content authoring phase, e.g. using the postprocessing module 35, so that no run-time optimization logic is needed at the VR display system 1. There may be multiple background rendering commands for each frame, each using, for example, one or more of rectangles, triangles, squares, circles and arbitrary polygons stored in the VR display system cache. The rendering of said regions is performed in a similar manner as for the foreground layer, effectively projecting the RGB data and the depth information D into 3D geometry.

For completeness, Figure 10 shows the post-processing module 35, the user interface 44 and the separate streams of RGB data 75, depth data 76 and control data 77 being transmitted to the VR display system 1 via the network 15 shown in Figure 1.

Referring to Figure 11, components of the media player 10 at the VR display system 1 are shown. The media player 10 may comprise a controller 81, RAM 83, a memory 85, and, optionally, hardware keys 87 and a display 89. The media player 10 may comprise a network interface 91, which may be a data port for connecting the system to the network 15 or the streaming module 43.

The network interface 91 may additionally or alternatively comprise a radiofrequency wireless interface for transmitting and/or receiving the post-processed data using a wireless communications protocol, e.g. WiFi or Bluetooth. An antenna 93 may be provided for this purpose.

The controller 81 may receive via the network interface 91 the separate streams of RGB data 75, depth data 76 and control data 77 for successive frames. The controller may also receive the depth information, e.g. a depth map.

The controller 81 may transmit and receive information with the VR headset 20. The memory 85 may be a non-volatile memory such as read only memory (ROM), a hard disk drive (HDD) or a solid state drive (SSD). The memory 85 stores, amongst other things, an operating system 94 and may store software applications 95. The RAM 83 is used by the controller 81 for the temporary storage of data. The operating system 94 may contain code which, when executed by the controller 81 in conjunction with the RAM 83, controls operation of each of the hardware components of the media player 10.

The controller 81 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor, or plural processors. In some embodiments, the media player 10 may also be associated with external software applications. These may be applications stored on a remote server device and may run partly or exclusively on the remote server device. These applications may be termed cloud-hosted applications. The media player 10 may be in communication with the remote server device in order to utilize the software application stored there.

Figure 12 is a flow diagram indicating steps performed by one or more software applications in accordance with embodiments. For example, Figure 12 may represent steps performed by the software application 95 in Figure 11, which is an application for decoding the above- mentioned control data 77 and for providing background data for filling- in the missing parts in accordance with the control data. It should be appreciated that certain ones of the steps shown in Figure 12 can be re-ordered and/or omitted. The numbering of steps is not necessarily indicative of their required order of processing.

In a first step 12.1, the foreground video content data is received. In another step 12.2, the received foreground video content data is stored in a first memory.

In another step 12.3, the control data is received. In another step 12.4, a copy of one or more regions of the foreground data, identified in the control data, is stored in a second memory. In physical terms, the second memory may be a separate memory apparatus or the same memory apparatus, with appropriate partitioning or segmentation. The memory may comprise any suitable form of storage apparatus, for example a hard disk drive (HDD) or a solid state drive (SSD). The memory may be cache memory. The one or more copied foreground regions can be of any shape and size, comprising any number of pixels, as determined by the control data. Also, the one or more copied foreground regions need not be copied from the current frame; the control data may refer to one or more previous frames, for example. In another step 12.5, background video content is provided for one or more regions not represented in the foreground data, using at least part of the foreground region(s) which have been stored in the second memory.

In another step 12.6, the foreground and background video content data may be rendered. In another step 12.7, the rendered foreground and background video content may be output to a user viewing device, or buffered for such output. Figure 13 is a flow diagram indicating additional steps that may be performed by one or more software applications, for example the software application 95 in Figure 11.

Again, certain ones of the steps shown in Figure 13 can be re-ordered and/or omitted. The numbering of steps is not necessarily indicative of their required order of processing.

Steps 13.1 - 13.4 correspond with steps 12.1 - 12.4 respectively.

In another step 13.5, which may be considered optional, the one or more copied foreground regions may be modified in accordance with the control data.

In another step 13.6, one or more background regions not stored in the first memory is or are identified. Put another way, one or more occluded regions for which there is no foreground data is or are identified.

In another step 13.7, background video content data is provided for one or more regions not represented in the foreground data, using at least part of the foreground region data stored in the second memory. In another step 13.8, the foreground and background video content data is rendered based on the control data.

In another step 13.9, the rendered foreground and background video content may be output to a user viewing device, or buffered for such output.

The received foreground video content data may comprise sub-streams of RGB colour data and depth information D. The foreground video content data may represent a single layer of a 360⁰ panorama. The foreground video content data may be backwards-compatible. The depth information D may be a depth map generated using any known means, for example by use of a LiDAR sensor or processing on stereo-pairs of images.

For completeness, Figure 14 is a functional block diagram which illustrates the process performed at the media player 10 in more detail. Solid arrows refer to the transfer of graphical (RGB and depth) data and broken arrows refer to the transfer of control data.

Element 100 represents the video stream received from the content provider 30 over the network 15. This may comprise the control data as separate stream. Element 101 represents the received foreground video content (RGB and depth information D). The foreground video content may be transferred to a foreground rendering module 103 which is used to render said foreground video content.

Element 102 represents an update background cache module 102, which responds to foreground copy commands extracted from the control data. The update background cache module 102 decodes the foreground copy commands to identify which foreground regions are to be copied, and then fetches these from element 101 for storage in a background cache module (element 104.) The background cache module 104 may be a persistent colour plus depth buffer containing data that is managed over all frames according to the control data. The background cache module 104 may be allocated as part of a Graphics Processing Unit (GPU) memory. At this stage, the copied foreground regions may be termed background regions or background video content. The foreground copy control commands may copy different shaped regions of the foreground video content, for example one or more of rectangles, triangles, squares, circles and arbitrary polygons. The foreground copy control data may refer to frames other than the current frame.

Element 105 represents a modification module 105 where one or more of scaling, transforming, animating and modifying the depth of the background video content in the background cache module 102 may be performed, in accordance with background move commands decoded from the video stream 100. For example, the modification control data may comprise one or more commands for moving and/or reshaping the background video content, and/or changing the depth information to account for 3D motion. For this reason, the background cache module 102 may employ double buffering for efficient implementation of the modification control data commands.

Example modification control data commands include, but are not limited to: modifying a source rectangle (x srcj ysrc, Wsrc, i^src ) into a destination rectangle (xdst, ydst, w<ist, hdst) with an optional constant depth offset d; this enables both scaling and movement in 3D; modify a source triangle (xo_src, yo_src, xisrc, yisrc, x2_src, y2_src) into a destination triangle (xOdst, yOdst, xidst, yidst, x2dst, y2d_st) with optional, per-vertex depth offsets do...di...d2; this enables motion of more refined regions as well as approximation of rotating 3D objects. The modified background video content may be copied to a background rendering module 106. The background rendering module 10 is configured to receive another set of control data, namely one or more background region render commands, for filling-in the occluded foreground regions identified in the foreground rendering memory 103. The one or more background region render commands may dictate which background video content stored in the background cache 104 is inserted into the occluded foreground regions. The selected background video content is rendered at the appropriate pixel positions. This enables performance optimizations by limiting the background rendering workload to only the necessary content in each frame. The control data authoring requires no runtime optimization logic. Rendering comprises projecting the colour and depth information D of the selected regions into 3D geometry.

Element 107 represents an output framebuffer which stores the combined foreground and background video content. The output framebuffer 107 may store data representing a plurality of frames for output to the VR headset 20 responsive to positional signals received therefrom.

Figures 15a - e are graphical examples which are useful for explaining the Figure 12 method. Figure 15a shows the first object 38 shown in Figure 5a. Figure 15b shows the occluded region 110 not present in the foreground video content. Figure 15c shows the potential result of a user moving their head towards the left-hand side, revealing the occluded region.

Figure isd, which results from methods disclosed herein, shows a copied foreground region 112 which includes the first object 38. This is because the fourth object 41 is similar in shape to the first object 38, other than scale and/or depth. The grid pattern indicates that the copied foreground region 112 may be considered an arrangement of squares (or other geometric shapes) which would allow more complex foreground regions to be copied. The foreground region 112 in this case is scaled and has corresponding depth information D added to reflect that the author wishes to represent the fourth object 41 in background video content. The foreground region 112 to be copied, the modification to be performed, and the method of rendering are scripted in the control data and transmitted with the foreground data to the media player 10 for rendering at said media player in accordance with steps outlined above. Referring to Figure ise, if the user moves their head towards the left-hand side, the positional data from the VR headset 20 is received and the generated background video content becomes visible. It will be appreciated from the above that, when a user moves their head during consumption of the VR content, one or more occluded or hidden regions not visible to the camera may be revealed to the user. This is achieved in a computationally efficient way using received control data, which also requires relatively low bandwidth. The control data provides a controllable and editable way of representing the background content whilst reducing streaming bandwidth and rendering workload. No additional video data is needed for the background content and the rendering complexity for the background content can be optimized at the authoring time. The control data may be edited independently of the foreground video content data, and so the background video content can be changed simply by changing the sequence of background control commands. This can enable savings in post- production where depth extraction and video compression are time-consuming operations.

The received video stream may be backwards-compatible, for example for use with legacy applications and applications not having head tracking.

It will be appreciated that the above described embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present application.

Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

Claims

1. A method comprising:

receiving a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location;

storing the foreground video content in a first memory;

receiving a second data stream representing control data associated with the first data stream, the control data indicating one or more selected regions of the received video content for providing one or more background regions;

storing a copy of the one or more selected regions in a second memory; and providing background video content for one or more background regions not represented in the foreground video content using at least part of the selected video content stored in the second memory.

2. The method of claim 1, wherein the rendered foreground and background video content is stored in an output buffer for outputting to a user viewing device as a buffered sequence of frames.

3. The method of claim 2, wherein the rendering of the background video content is based on background rendering instructions provided in the control data.

4. The method of claim 2 or claim 3, further comprising receiving positional data indicative of user movement and wherein the rendering of the background video content comprises identifying one or more newly-visible regions based on the positional data and rendering the background video content corresponding to the one or more newly-visible regions.

5. The method of claim 4, wherein the positional data is received from a user viewing device and wherein the newly-visible regions correspond to regions occluded from the user's viewing perspective.

6. The method of claim 4 or claim 5, wherein identifying the one or more newly-visible regions comprises identifying regions having no video content.

7. The method of any preceding claim, wherein the one or more selected regions comprise one or more predetermined polygons.

8. The method of any preceding claim, further comprising applying or associating a modification to the one or more selected regions.

9. The method of claim 8, wherein the modification comprises one or more of scaling, transforming, animating and modifying the depth of the one or more selected regions.

10. The method of claim 8 or claim 9, wherein the control data further indicates the modification to be applied or associated to each of the one or more regions.

11. The method of any preceding claim, wherein the first and second data streams are received simultaneously.

12. The method of any preceding claim, wherein the second data stream comprises control data associated with each frame of foreground video content.

13. The method of any preceding claim, wherein the first data stream comprises a first sub-stream representing colour foreground data and a second sub-stream representing depth information associated with the colour foreground data.

14. The method of any preceding claim, wherein the second memory is a persistent colour plus depth buffer that is managed over multiple frames according to the control data.

15. The method of any preceding claim, wherein the second memory uses double buffering.

16. The method of any preceding claim, performed at a media playing device associated with a user viewing device.

17. A method comprising:

providing a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location;

providing a second data stream representing control data associated with the first data stream, the control data including:

one or more instructions for copying one or more selected regions of the provided foreground video content; and

one or more instructions for providing background video content for one or more background regions not represented in the first data stream using at least part of the selected one or more regions of the foreground video content which are instructed to be copied.

18. The method of claim 17, wherein the one or more instructions for providing the background video content comprise rendering instructions for the background video content for output to a user viewing device.

19. The method of claim 18, wherein the rendering instructions further indicate how the background video content is to be rendered based on positional data indicative of user movement which reveals one or more newly visible regions.

20. The method of any of claims 17 to 19, wherein the one or more instructions for copying one or more selected regions of the provided foreground video content comprises identifying one or more predetermined polygons.

21. The method of any of claims 17 to 20, wherein the control data further comprises one or more instructions for applying or associating a modification to the one or more selected regions of the foreground video content.

22. The method of claim 21, wherein the modification instructions comprise one or more of scaling, transforming, animating and modifying the depth of the one or more selected regions of the foreground video content.

23. The method of any of claims 17 to 22, wherein the first and second data streams are transmitted simultaneously.

24. The method of any of claims 17 to 23, wherein the second data stream comprises control data associated with each frame of video content.

25. The method of any of claims 17 to 24, wherein the first data stream comprises a first sub-stream representing colour foreground data and a second sub-stream representing depth information associated with the colour foreground data.

26. The method of any of claims 17 to 25, performed at a content provider system configured to send the first and second data streams to one or more remote media playing devices.

27. A computer program comprising instructions that when executed by a computer control it to perform the method of any preceding claim.

28. An apparatus configured to perform the method steps of any of claims 1 to 26.

29. A non-transitory computer-readable medium having stored thereon computer- readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising:

storing the foreground video content in a first memory;

30. A non-transitory computer-readable medium having stored thereon computer- readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising:

one or more instructions for providing background video content for one or more background regions not represented in the first data stream using at least part of the selected regions of the foreground video content which are instructed to be copied.

31. An apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor: to receive a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location;

to store the foreground video content in a first memory;

to receive a second data stream representing control data associated with the first data stream, the control data indicating one or more selected regions of the received video content for providing one or more background regions;

to store a copy of the one or more selected regions in a second memory; and to provide background video content for one or more background regions not represented in the foreground video content using at least part of the selected video content stored in the second memory.

32. An apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:

to provide a first data stream representing a plurality of frames of three-dimensional foreground video content captured from a first location;

to provide a second data stream representing control data associated with the first data stream, the control data including: