GB2572996A

GB2572996A - Processing video patches for three-dimensional content

Info

Publication number: GB2572996A
Application number: GB1806370.1A
Authority: GB
Inventors: Pesonen Mika; Hannuksela Miska; Roimela Kimmo; Pystynen Johannes
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2018-04-19
Filing date: 2018-04-19
Publication date: 2019-10-23
Also published as: EP3782368A1; WO2019202207A1; EP3782368A4; GB201806370D0

Abstract

A plurality of patches (61, 62) (e.g. volumetric pixels, points of a point cloud or mesh) representing part of a volumetric scene are provided, along with patch visibility information for each patch indicative of directions from which a forward surface of the patch is visible. One or more viewing positions (60) associated with a client device (V) are provided, and the patch(es) are processed dependent on whether the patch visibility information indicates that the forward surface of the patch(es) is visible from the viewing position(s). If rear-facing (e.g. patch 61), a patch may not be rendered, or may not be transmitted to a client. Providing the patches may comprise estimating surface normals from points of a volumetric scene, and grouping together points having similar surface normals to provide a given patch. Metadata (e.g. SEI messages, manifest or descriptor of media content) indicative of the patch visibility may be transmitted to the client device.

Description

- 1 Processing Video Patches for Three-Dimensional Content

Field

Embodiments relate to an apparatus and method for processing video patches for three5 dimensional content.

Background

Three-dimensional (3D) video content comprises data which may be rendered for output to a display to provide an image or sequence of images representing one or more three10 dimensional objects or scenes for user consumption.

For example, virtual reality (VR) is a rapidly developing area of technology in which video content, sometimes accompanied by audio, is provided to a user device, such as a user headset. As is known, the user device may be provided with a live or stored feed from a 15 content source, the feed representing a virtual space for immersive output through the user device. Position and/or movement of the user device can enhance the immersive experience. Currently, most virtual reality user devices use so-called three degrees of freedom (3D0F) which means that the head movement in the yaw, pitch and roll axes are measured and determine what the user sees. An enhancement is a six degrees-of-freedom (6D0F) virtual 20 reality system, where the user is able to freely move in Euclidean space and rotate their head in the yaw, pitch and roll axes. Six degrees-of-freedom virtual reality systems enable the provision and consumption of volumetric content. Volumetric content comprises data representing spaces and/or objects in three-dimensions from all angles, enabling the user to move fully around the spaces and/or objects to view them from any angle.

For the avoidance of doubt, references to virtual reality (VR) are also intended to cover related technologies such as augmented reality (AR) and mixed reality (MR.)

Summary

An embodiment provides an apparatus comprising: means for providing a plurality of patches representing at least part of a volumetric scene; means for providing, for each patch, patch visibility information indicative of a set of directions from which a forward surface of the patch is visible; means for providing one or more viewing positions associated with a client device; means for processing one or more of the patches dependent on whether the 35 patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.

- 2 Providing may comprise receiving, transmitting and/or generating. The apparatus may be a server of data representing part of the volumetric scene. Alternatively, the apparatus may be a client device for receiving the data from a server.

The means for providing the plurality of patches may be configured to estimate surface normals from points of the volumetric scene, and group together points having similar surface normals to provide a given patch.

The processing means may be configured to transmit one or more visible patches to the client 10 device and not to transmit one or more non-visible patches to the client device.

The processing means may be configured to transmit identifiers of one or more visible patches to the client device and not to transmit identifiers of non-visible patches to the client device, the identifiers being usable by the client device to retrieve the patches from a patch 15 server.

The processing means may be configured to generate a texture atlas comprising the plurality of patches, to cull from the texture atlas one or more non-visible patches, and to transmit the texture atlas to the client device for decoding and/or rendering thereat.

The processing means may be configured to generate first and second texture atlases by means of producing, for each patch, a colour image and a depth image, to provide the colour image for a given patch to a respective part of the first texture atlas and the depth image for the given patch to the respective part of the second texture atlas, to cull the colour and depth 25 images corresponding to non-visible patches, and to transmit the culled first and second texture atlases to the client device for decoding and/or rendering thereat.

The processing means may be configured to project each patch to a two-dimensional geometry to provide the colour image and the depth image.

The visible patches / texture atlases maybe transmitted to the client device as video frames.

The apparatus may further comprise means for transmitting metadata to the client device, the metadata being indicative of the patch visibility information for patches. The metadata 35 may be further indicative of decoding parameters and/or requirements of the patches.

The metadata may be transmitted to the client device using one or more of:

- supplemental enhancement information (SEI) messages;

- a structure of a containing file format; and

- a manifest or descriptor of media content.

The metadata may be quantized. The metadata may indicate patch visibility information by 5 means of a normal vector and an angle.

The patch visibility information for a patch maybe determined by projecting an image of the patch onto a near plane and a far plane using depth information for the patch, and wherein the processing means may be configured to identify the patch as visible or non-visible based 10 on which side of the near and fair planes the one or more viewing positions are located.

The patch visibility information for a patch may be determined by determining a visibility cone derived from a normal cone direction vector and opening angle that includes approximately the minimum cone from which the forward surface of the patch can be seen 15 and increasing the opening angle by a predetermined amount.

The visibility cone maybe determined by increasing the opening angle by approximately 180 degrees.

The patch visibility information may be determined over a plurality of temporal frames of the volumetric scene.

The apparatus maybe a server for transmitting patches to one or more client devices.

The means for providing a one or more viewing positions may further provide one or more predicted viewing positions associated with the client device, and wherein the processing means is configured to process the one or more patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches will be visible from the one or more predicted viewing positions.

The apparatus maybe a client device, whereby the means for providing the plurality of patches is configured to receive the patches from a server, whereby the means for providing the patch visibility information may be configured to receive the patch visibility information from the, or a different, server, and wherein the processing means may be configured to 35 decode and/or render one or more patches dependent on whether the received patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.

-4Another embodiment provides method, comprising: providing a plurality of patches representing part of a volumetric scene; providing, for each patch, patch visibility information indicative of a set of directions from which a forward surface of the patch is visible; providing one or more viewing positions associated with a client device; and processing one or more of the patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.

Preferred features of the method are defined in the appended claims.

Another embodiment provides a computer program comprising computer-readable instructions which, when executed by at least one processor, cause the at least one processor to perform the method of any preceding method definition.

Another embodiment provides a non-transitory computer-readable medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising: providing a plurality of patches representing part of a volumetric scene; providing, for each patch, patch visibility information indicative of a set of directions from which a forward surface of the patch is visible; providing one or more viewing positions associated with a client device; processing one or more of the patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.

Another embodiment provides an apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor: to provide a plurality of patches representing part of a volumetric scene; to provide, for each patch, patch visibility information indicative of a set of directions from which a forward surface of the patch is visible; to provide one or more viewing positions associated with a client device; and to process one or more of the patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.

Brief Description of Drawings

Embodiments will now be described, by way of non-limiting example, with reference to the drawings, in which:

FIG. 1 is an example of an audio capture system which may be used in order to capture video and/or audio signals for processing in accordance with various examples described herein;

-5FIG. 2 is a schematic diagram of a virtual reality processing apparatus in relation to one or more user devices and a communications network in accordance with various examples described herein;

FIG. 3 is a block diagram of an example pipeline comprising server and user devices in accordance with various examples described herein;

FIG. 4A is a perspective view of part of a point cloud relative to an underlying surface;

FIG. 4B is a perspective view of surface normals for respective points of the FIG. 4A point cloud, determined in accordance with various examples described herein;

FIG. 4C is a perspective view of a visibility cone, determined in accordance with various examples described herein;

FIG. 5 is a schematic representation of first and second atlases, determined in accordance with various examples described herein;

FIG. 6 is a schematic representation of first and second two-dimensional patches, relative to a view point, determined in accordance with various examples described herein;

FIG. 7 is a schematic representation indicating a culling operation performed on a patch, in accordance with various examples described herein;

FIG. 8 is a flow diagram showing processing operations of a method, in accordance with various examples described herein; and

FIG. 9 is a schematic diagram of components for implementing one or more of the FIG. 8 20 processing operations, in accordance with various examples described herein.

Detailed Description

In the description and drawings, like reference numerals may refer to like elements throughout.

Embodiments herein relate to the processing of video patches for three-dimensional content (hereafter “content”). The content may represent one or more objects or scenes. The content may represent content captured using one or more cameras, for example using a multicamera module such as Nokia’s OZO camera, a laser scanner and/or a combination of video 30 and dedicated depth sensors. The content may be generated using three-dimensional models, e.g. using computer-generated imagery (CGI). A combination of captured and CGI is possible. The content may be panoramic. The content may or may not be accompanied by audio.

Embodiments may relate to coding and transport of content. Coding may comprise compression. Embodiments may also relate to decoding and rendering of content. Decoding may comprise decompression.

- 6 An example application of three-dimensional content is in virtual reality (VR). Virtual reality is a rapidly developing area of technology in which video content, sometimes accompanied by audio, is provided to a user device such as a user headset. As is known, the user device may be provided with a live or stored feed from a content source, the feed representing a virtual 5 space for immersive output through the user device. The position of the user device can enhance the immersive experience. A change of position, i.e. movement, can also enhance the immersive experience. Currently, most virtual reality user devices use so-called three degrees of freedom (3D0F), which means that the head movement in the yaw, pitch and roll axes are measured and determine what the user sees. An enhancement is a six degrees-of10 freedom (6D0F) virtual reality system, where the user may freely move in Euclidean space as well as rotate their head in the yaw, pitch and roll axes. Six degrees-of-freedom virtual reality systems enable the provision and consumption of volumetric content. Volumetric content comprises data representing spaces and/or objects in three-dimensions from all angles, enabling the user to move fully around the space and/or objects to view them from any angle.

Such content may be defined by data describing the geometry (e.g. shape, size, position in a three-dimensional space) and attributes such as colour, opacity and reflectance. The data may also define temporal changes in the geometry and attributes at given time instances, similar to frames in two-dimensional video.

Typical representation formats for volumetric content include triangle meshes, point clouds and voxels. Temporal information about the content may comprise individual capture instances, i.e. frames or the position of objects as a function of time.

Advances in computational resources and in three-dimensional acquisition devices enable 25 reconstruction of highly-detailed volumetric representations. Infrared, laser, time-of-flight and structured light technologies are examples of how such content may be constructed. The representation of volumetric content may depend on how the data is to be used. For example, dense voxel arrays may be used to represent volumetric medical images. In threedimensional graphics, polygon meshes are extensively used. Point clouds, on the other hand, 30 are well suited to applications such as capturing real-world scenes where the topology of the scene is not necessarily a two-dimensional surface or manifold. Another method is to code three-dimensional data to a set of texture and depth maps. Closely related to this is the use of elevation and multi-level surface maps. For the avoidance of doubt, embodiments herein are applicable to any of the above technologies.

In technologies such as dense point clouds and voxel arrays, there may be tens or even hundreds of millions of points. In order to store and transport such content between entities, such as between a server and a client over an IP network, compression is usually required.

-ΊEmbodiments herein relate to the processing of video patches representing volumetric threedimensional content. Volumetric content may comprises data representing spaces and/or objects in three-dimensions such that they can be viewed from different angles. Examples particularly relate to virtual reality content, e.g. for transmission to a virtual reality user device where the user may explore the volumetric content by movement when consuming the content. However, it should be appreciated that embodiments are not limited so such virtual reality applications.

For the avoidance of doubt, references to virtual reality are also intended to cover related technologies such as augmented reality (AR) and mixed reality (MR).

A video patch (hereafter “patch”) may refer to a group of volumetric pixels (voxels) or points of a point cloud or another kind of surface representation such as a polygon mesh or Bezier 15 patch. In some embodiments, a patch may be determined by grouping together voxels or points having similar surface normals, e.g. within a predetermined range of angles. These voxels or points may be adjacent or relatively close together. Visibility of the voxels or points may be a grouping criteria, e.g. linked to surface normals. In some embodiments, multiple patches are determined based on respective surface normal similarities, and processing of 20 said patches may depend on the visibility of said patches.

FIG. 1 is an example of a capture system 1 which may be used to capture video (and audio) signals for processing in accordance with various examples described herein. Although the capture of audio signals is described for completeness, embodiments relate mainly to video 25 capture.

In this example, the capture system 1 comprises a spatial capture apparatus 10 configured to capture video and a spatial audio signal. The capture system 1 may also comprise one or more additional audio capture devices 12A, 12B, 12C.

The spatial capture apparatus 10 is configured to capture video content by way of a plurality of visual content capture devices 102A-G (e.g. cameras). The plurality of visual content capture devices 102A-G may be configured to capture visual content from different directions around the apparatus, thereby to provide volumetric content for consumption by users. In 35 the example of FIG. 1, the spatial capture apparatus 10 is a presence-capture device, such as

Nokia’s OZO camera, being an array of cameras and microphones. However, as will be appreciated, the spatial capture apparatus 10 may be another type of device and/or may be made up of plural physically separate devices. As will also be appreciated, although the

-8content captured may be suitable for provision as immersive content, it may also be provided in a regular non-VR format for instance via a smart phone or tablet computer.

The spatial capture apparatus to may also comprise a plurality of audio capture devices 101A,

B (e.g. directional or non-directional microphones) which are arranged to capture audio signals which may subsequently be spatially rendered into an audio stream in such a way that the reproduced sound is perceived by a listener as originating from at least one virtual spatial position. Typically, the sound captured by the spatial audio capture apparatus 10 is derived from plural different sound sources which may be at one or more different locations relative to the spatial capture apparatus to. As the captured spatial audio signal includes components derived from plural different sounds sources, it may be referred to as a composite audio signal. Although only two audio capture devices 101A, B are visible in FIG.

1, the spatial capture apparatus io may comprise more than two devices toiA, B. For instance, in some specific examples, the spatial capture apparatus io may comprise may 15 comprise eight audio capture devices.

In the example of FIG. 1, the spatial capture system 1 may further comprises one or more additional audio capture devices 12A-C. Each of the additional audio capture devices 12A-C may comprise at least one microphone and, in the example of FIG. 1, the additional audio 20 capture devices 12A-C are lavalier microphones configured for capture of audio signals derived from an associated user 13A-C. For instance, in FIG. 1, each of the additional audio capture devices 12A-C is associated with a different user by being affixed to the user in some way.

The locations of the additional audio capture devices 12A-C and/or the spatial capture apparatus 10 within the audio capture environment may be known by, or may be determinable by, the capture system 1 (for instance, a virtual reality processing apparatus 14). For instance, in the case of mobile capture devices, the devices may include a location determination component for enabling the location of the devices to be determined. In some specific examples, a radio frequency location determination system such as Nokia’s High Accuracy Indoor Positioning may be employed, whereby the additional audio capture devices 12A-C (and in some examples the spatial capture apparatus 10) transmit messages for enabling a location server to determine the location of the additional audio capture devices within the audio capture environment. In other examples, for instance when the additional audio capture devices 12A-C are static, the locations may be pre-stored by an entity which forms part of the capture system 1 (for instance, the virtual reality processing apparatus 14).

In some embodiments, the spatial capture system 1 may not include additional audio capture devices 12A-C.

-9In the example of FIG. 1, the capture system 1 further comprises the virtual reality processing apparatus 14. The virtual reality processing apparatus 14 may be a server, or it may be associated with another server. In embodiments herein, it is assumed that the virtual reality 5 processing apparatus 14 also encodes and serves the content to one or more user devices, but this may be performed at a separate server system (not shown). This serving, or transmission, may be by streaming transmission or any other suitable transmission format.

The virtual reality processing apparatus 14 maybe configured to receive and store signals captured by the spatial capture apparatus 10 and/or the one or more additional audio capture 10 devices 12A-C. The signals may be received at the virtual reality processing apparatus 14 in real-time during capture of the audio and video signals or may be received subsequently, for instance via an intermediary storage device. In such examples, the virtual reality processing apparatus 14 may be local to the audio capture environment or may be geographically remote from the audio capture environment in which the capture apparatus 10 and devices 12A-C are 15 provided. In some examples, the virtual reality processing apparatus 14 may even form part of the spatial capture apparatus 10.

FIG. 2 is a schematic view of the virtual reality processing apparatus 14 in relation to a network 16, which may be an Internet Protocol (IP) network such as the Internet, or any 20 other form of data network, and a plurality of remote users 20A - 20C having respective user headsets 22A - 22C for consuming the content when rendered. The virtual reality processing apparatus 14 may stream the content over multiple transmission channels via the network 16. The remote users 20A - 20C may be co-located or located in separate real-world locations, possibly in different countries. What each remote user 20A - 20C sees through the video 25 screens and/or headphones of their respective headsets 22A - 22C is part of a virtual space or scene.

In the context of this specification, a virtual space or scene is any computer-generated version of a space, for example the volumetric real world space captured using the capture 30 system 1 shown in FIG. 1, in which one or more users 20A - 20C can be immersed. In some example embodiments, the virtual space may be entirely computer-generated, e.g. CGI. The headsets 22A - 22C may be of any suitable type. The headsets 22A - 22C may be configured to provide virtual reality video and/or audio content to the respective users 20A - 20C. As such, the users may be immersed in virtual space.

The headsets 22A - 22C may receive the content directly from the virtual reality processing apparatus 14, or, in some embodiments, from a separate media player 24 to which the headset is connected. For example, the media player 24 may include a games console, or a

- 10 personal computer (PC) configured to receive visual and/or audio data from the virtual reality processing apparatus, via the network 16, and communicate this to the headset 22A shown in FIG. 2. Alternatively, the media player 24 may form part of the headset 22A. In some examples, the media player 24 may comprise a mobile phone, smartphone or tablet 5 computer configured to play content through its display.

The headsets 22A - 22C may include means for determining the spatial position of the respective users 20A - 20C and/or orientation of the respective user’s head. In some embodiments, therefore, the headsets 22A - 22C may track movement using six degrees-of10 freedom. Over successive time frames, a measure of movement may therefore be calculated and stored. For example, the headsets 22A - 22C may incorporate motion tracking sensors which may include one or more of gyroscopes, accelerometers and structured light systems. These sensors generate position data from which a current position and visual field-of-view (FOV), in other words a viewport, is determined and updated as the one or more user 15 headsets 22A - 22C change position and/or orientation. The headsets 22A - 22C may comprise gaze tracking means used to determine a direction of the user’s gaze, which can be used to determine an object of interest the user is looking at. The headsets 22A - 22C may comprise, or be associated with, other limb tracking means to determine the position or orientation of a limb of the user.

The headsets 22A - 22C may typically comprise two digital screens for displaying stereoscopic video images of the virtual space in front of respective eyes of the user, and also two speakers for delivering audio. The headsets 22A - 22C may comprise one or more cameras. Images from the one or more cameras may be presented to the user through the 25 screens of the headsets 22A - 22C, such that the real world environment is displayed to the user in a “see-through mode”, or an augmented reality mode.

The example embodiments herein, which primarily relate to the delivery of virtual reality content, are not limited to a particular type of virtual reality headset 22A - 22C. Any form of 30 user display device may be used.

In some example embodiments, the headsets 22A - 22C, or one or more systems connected to the headsets, such as the media player 24, may determine the spatial position and/or orientation of the respective users 20A - 20C within the virtual space. These may include 35 measurements of pitch, roll and yaw and also translational movement in Euclidean space along side-to-side, front-to-back and up-and-down axes (i.e. six degrees-of-freedom).

- 11 The headsets 22A - 22C, or the media player 24, may be configured to display content based on the spatial position and/or the orientation of the respective headset. A detected change in spatial position and/or orientation, i.e. a form of movement, may result in a corresponding change in the visual and/or audio data to reflect a position or orientation transformation of 5 the user 20A - 20C with reference to the space into which the visual and/or audio data is projected. This allows virtual reality content data to be consumed with the user 20A - 20C experiencing a three-dimensional (3D) virtual reality environment.

In the context of volumetric virtual reality spaces, this means that the user’s position can be detected relative to content provided within the volumetric virtual reality content, e.g. so that the user can move freely within a given virtual reality space, around individual objects or groups of objects, and can view the objects from different angles depending on the movement (e.g. rotation and location) of their head in the real world. In some examples, the user may also view and explore a plurality of different virtual reality spaces and move from one virtual reality space to another one.

The angular extent of the environment observable or hearable through the respective headsets 22A - 22C is called the visual field of view (FOV). The actual FOV observed or heard by a user depends on the inter-pupillary distance and on the distance between the 20 lenses of the virtual reality headset 22A - 22C and the user’s eyes, but the FOV can be considered to be approximately the same for all users of a given display device when the virtual reality headset is being worn by the user. The portion of virtual reality content that is visible at a given time instant may be called a viewport.

When viewing volumetric content from a single viewpoint, a portion (often half) of the content may not be seen because it is facing away from the user. This portion is sometimes called “back facing content”. Given the limited processing power of user devices, such as the headsets 22A - 22C and/or the media player 24, limiting the amount of data required to be processed, whether in terms of the virtual reality processing apparatus 14 (or other server) encoding and transmitting the data and/or the headsets 22A - 22C and/or media player 24 rendering the data, offers technical advantages. This may be particularly important where, for example, the headsets 22A - 22C are using processing power to run cameras (e.g. in the case of augmented reality) or other applications.

For example, in a city street scene, there is generally no need to encode, transmit and/or render objects on the sides of the buildings facing away from the field of view, because they are completely occluded.

- 12 Embodiments may comprise providing a plurality of patches representing part of a volumetric scene, and providing, for each patch, patch visibility information indicative of a set of directions from which a forward surface of the patch is visible. Embodiments may further comprise providing one or more viewing positions associated with a client device, and 5 processing one or more of the patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.

Advantages of such embodiments include increased processing efficiency, particularly at client devices, such as at the headsets 22A - 22C and/or the media player 24. This is because the amount of data that is encoded, transmitted, decoded and/or rendered is limited based on what parts of objects and/or scenes are visible and what are back facing.

The patches may be provided by estimating surface normals from points of the volumetric 15 scene, and grouping together points having similar surface normals to provide a given patch.

A surface normal may be considered a vector which is perpendicular to the surface of the volumetric scene, or a part thereof. The vector may be pointing outside of the surface.

FIG. 3 is a block diagram of an example volumetric video processing pipeline (hereafter 20 “pipeline”) 30 from content processing to content rendering. The pipeline 30 comprises a plurality of modules, some of which represent storage and some of which represent processing modules. In the shown embodiment, some modules are provided at the virtual reality processing apparatus 14 (or other server) and other modules are provided at a client end, such as the second headset 22B, although any client end device is applicable. At the 25 second headset 22B, a client rendering module 38 is provided for receiving, decoding and rendering content to the display screens of the headset. A client view tracking module 39 is provided for transmitting position information indicative of the current or a predicted fieldof-view of the second headset 22B which may be fed back to the virtual reality processing apparatus 14. The position information may be generated using previously-described 30 systems and methods for spatial position determination.

In some embodiments, the patch transcoding and packing module 36 and the patch culling module 37 may be provided in a network edge node that performs real-time processing for content streaming. In this scenario, the remaining modules 31, 32, 33, 34 of the virtual 35 processing apparatus 14 may provide off-line processing of content data.

In some embodiments, the patch transcoding and packing module 36 may be performed in the client end, e.g. based on a prescription as described later.

-13In some embodiments, the patch culling may be performed by the client end. The client end may resolve addresses or identifiers, such as Uniform Resource Locators (URLs), for patches to be received and request the patches using those addresses or identifiers.

Fewer or a greater number of modules may be provided, and it will be appreciated that such modules may be provided in hardware or software form, or a combination thereof.

In the FIG. 3 embodiment, data representing volumetric video content 31 is provided, all or 10 some of which may be encoded and transmitted to one or more of the headsets 22A - 22C and/or the media player 24 for decoding and rendering. The content 31 may represent any three-dimensional object or scene.

The pipeline 30 may comprise a normal generation module 32 for determining or estimating 15 surface normals from at least some of the volumetric content. Surface normals represent vectors perpendicular to a surface.

To illustrate, FIG. 4A shows part of a point cloud 40 comprised of a plurality of points 41. Methods and systems for generating the point cloud 40 from a captured or CGI object or scene are known. Further information is provided at http://pointclouds.org. The underlying object 42 which the point cloud 40 represents is shown in dotted lines for reference. In other embodiments, the points 41 may represent voxels. FIG 4B shows estimated or determined normals 50 (particularly, surface normals) for each point 41 of the point cloud 40, i.e. vectors perpendicular to the surface of the object 42. The surface normals 50 vary in angle or orientation because the object 42 is non-planar.

The pipeline 30 may also comprise a patch optimisation module 33, which is configured to determine patches based on one or more rules. For example, a patch may comprise all points 41 having surface normals 50 in the same direction. Additionally, a patch may comprise all 30 points 41 having an orientation or opening angle within a particular range, e.g. 15 degrees, which may be defined by a view cone 51. Other opening angles may be used. For example, first and second shown surface normals 44,45 may be determined similar whereas the surface normal 46 may be considered non-similar, because it is outside of the view cone 51. An example patch 47 is shown in shaded representation. The patch 47 is a volumetric patch 35 in that the points within it represent three-dimensional content.

Multiple patches 47, 48, 49 may be determined in this manner until all points 41 of the point cloud 40 are allocated to a patch. Partial views of second and third patches 48, 49 are shown

-14for illustration. It follows that each patch 47,48,49 may be of a different size and/or shape. The pipeline 30 may therefore provide and store a set of volumetric patches in a volumetric patches module 34 (see FIG. 3).

Another operation provided by the patch optimization module 33 is to determine patch visibility information 35. An example embodiment for determining patch visibility information 35 is described, but it should be appreciated that other methods are possible. Patch visibility information is data indicative of where in the volumetric space the forward surface of the patch can be seen.

Patch visibility information 35 for each patch 47,48, 49 may be calculated from patch normals of each respective patch. For example, a visibility cone may be determined, as shown in FIG. 4C, for the first patch 47, which comprises a visibility cone direction vector (X, Y, Z) and an opening angle (A). The opening angle (A) defines a set of spatial angles from which the forward 15 surface of the first patch 47 can be seen. The visibility cone can be determined using the minimal cone that encloses all normal vector directions for the given patch 47. First, for the given patch 47, a normal cone angle is determined, for example by determining the largest angle between all normals for the patch and by minimizing the dot product between all normals in the patch. Once the two normals that have the largest angle between them have been 20 determined, the normal cone direction vector can be calculated halfway between them, e.g. by summing these two vectors together and normalizing the result. The visibility cone may be determined from the normal cone by using the same direction vector, and adding 180 degrees to the opening angle to account for visibility of forward surfaces. For concave patches, the visibility cone can be additionally optimized by considering self-shadowing of the patch, by 25 applying ray-tracing or other methods to detect the visibility.

To summarize, each patch 47, 48, 49 will have a normal vector (X, Y, Z) and angle (A) as additional parameters, which may be represented as a visibility cone for the patch. In a practical implementation, the data (X, Y, Z) and (A), may describe patch visibility over a group 30 of frames rather than per frame.

The patch visibility information may be provided as patch visibility metadata for each respective patch 47, 48, 49. The patch visibility metadata may be used for encoding the video bit stream containing each patch 47, 48, 49. The containing structures for patch visibility 35 metadata may include but are not limited to the following:

supplemental enhancement information (SEI) messages, e.g. as defined in H.264/AVC or HEVC (a.k.a. H.265);

-ι₅structures of a containing file format used in encapsulating and/or providing metadata for the video bit stream; for example, in files conforming to the ISO base media file format (ISOBMFF, ISO/IEC 14496-12), timed metadata structures, such as a timed metadata track, a sample group, sample auxiliary information, or 5 SubSampleinformationBox may be used;

structures within manifests or descriptions of media content, such as session description protocol (SDP) or dynamic adaptive streaming over HTTP (DASH, ISO/IEC 23009-1) MPD; for example, a descriptor element could contain patch visibility metadata in DASH MPD.

In one embodiment, the patch visibility metadata may comprise the visibility cone direction vector (X, Y, Z) and the angle (A). It should be noted that the visibility cone direction vector (X, Y, Z) and angle value (A) may be heavily quantized (e.g. 6-bits per X, Y, Z and A) and represented only using small numbers. Therefore, the additional per-patch storage in the 15 video bit stream may be minimal. Such quantization may be performed conservatively so that the quantized cone encloses the original non-quantized cone.

In another embodiment, the patch visibility metadata may comprise a definition of a bounding sphere surface and sphere region metadata, identical or similar to that specified by the 20 omnidirectional media format (OMAF) standard (ISO/IEC 23090-2). The bounding sphere surface may for example be defined by a three-dimensional location of the centre of the sphere, and the radius of the sphere. When the viewing position collocates with the bounding sphere surface, the patch may be considered visible within the indicated sphere region. This embodiment may for example suit complex scenes with distinct objects that can be surrounded 25 by a bounding sphere surface.

In general, the geometry of the bounding surface may also be something other than a sphere, such as cylinder, cube, or cuboid. Multiple sets of patch visibility metadata may be defined for the same three-dimensional location of the centre of the bounding surface, but with different 30 radii (or information indicative of the distance of the bounding surface from the threedimensional location). Indicating several pieces of patch visibility metadata may be beneficial to handle occlusions.

In some cases the metadata structure may describe several visibility cones.

It should be understood that patch visibility metadata may be approximate spatially and/or temporally. For example, the patch visibility metadata may describe a visibility cone that may

-16 contain locations from where the patch is not visible but encloses all locations from where the patch is visible.

In another example, the patch visibility metadata may describe a visibility cone that guarantees 5 the visibility of the patch regardless of the location within the cone, while there may be additional locations outside the visibility cone from where the patch is also visible.

The type of the patch visibility metadata (e.g. accurateness spatially and/or temporally) may be pre-defined, e.g. in a coding standard, or it may be indicated in or along the bit stream by 10 an encoder, and/or it may be decoded from or along the bit stream by a decoder.

In some embodiments, instead of, or in addition to binary visibility, the patch visibility metadata maybe indicative of the absolute and/or relative picture quality of a texture patch. For example, quality ranking metadata similar to the OMAF and/or DASH standards maybe 15 provided with the patch visibility metadata.

The patch visibility metadata for each patch (e.g. the visibility cones as indicated in FIG. 4C, or other metadata generating using any of the other methods) maybe provided to a patch culling module 37.

The patch culling module 37 may be configured to determine which patches 47, 48,49 are transmitted to a user device, for example the client rendering module 38 of the headset 22B. The client rendering module 38 maybe hardware, software or a combination thereof for receiving, decoding and rendering received patch data to the screens of the headset 22B based on view parameters generated by the client view tracking module 39.

In some embodiments, the patch culling module 37 is configured to cull (i.e. supress or remove) patches 47, 48,49 which are not visible to the user as determined based on the view parameters and the patch visibility metadata. That is, if it is determined that one or more 30 patches 47,48,49 cannot be seen based on the current field-of-view, or will not be seen based on a predicted future field-of-view, then those one or more patches are not encoded or transmitted.

In other embodiments, all patches 47, 48,49 maybe encoded and transmitted, together with 35 the patch visibility metadata, and the client rendering module 38 is configured based on the locally determined view parameters, and the received patch visibility metadata, whether or not to decode and render one or more patches. That is, if it is determined by the client rendering module 38 that one or more patches 47,48, 49 cannot be seen based on the

-17current field-of-view, or will not be seen based on a predicted future field-of-view, then those one or more patches are not decoded and/or rendered.

In yet other embodiments, culling may happen in both stages, so that patch culling module

37 does initial culling based on a predicted field-of-view, and the client rendering module 38 does more fine-grained culling based on the final view parameters in effect during rendering.

In any case, there is a saving in terms of the processing resources involved.

For example, referring back to FIG. 3, a patch transcoding and packing module 36 may be configured to receive each patch 47, 48,49 stored in the volumetric patch module 34 and to encode and/or transmit patches based on feedback from the patch culling module 37.

An example operation of the patch transcoding and packing module 36 will now be described.

Each patch 47,48,49 may be projected to a two-dimensional colour (or other form of texture) image and to a corresponding depth image, also known as a depth map. This conversion enables each patch 47, 48,49 to be converted back to volumetric form at a client 20 rendering module 38 of the headset 22B using both images. It should be understood that embodiments are not related to colour and depth patches, but can be realized additionally or alternatively for other types of patches, such as reflectance, opacity or transparency (e.g. alpha channel patches), surface normal, albedo, and/or other material or surface attribute patches.

In some embodiments, the two-dimensional form of each patch 47,48, 49 maybe packed into one or more atlases. Texture atlases are known in the art, comprising an image consisting of sub-images, the image being treated as a single unit by graphics hardware and which can be compressed and transmitted as a single image for subsequent identification and 30 decompression. Referring to FIG. 5, first and second atlases 52,53 are provided, one for the colour or texture images and one for the depth images. As shown, the first atlas 52 stores a two-dimensional colour image 54 of the first patch 47 as a first sub-image and the second atlas 53 stores a corresponding depth image 55 of the first patch 47 as a sub-image at the corresponding location. Images derived from the other patches 48,49 may be stored at different respective regions of the first and second atlases 52,53.

-18The first and second atlases 52,53 may then be encoded, e.g. compressed, as video frames using any known method and transmitted, e.g. streamed, to the headset 22B for decoding and rendering.

The sub-image layout in the first and second atlases 41,42 for the patches 47,48,49 may be optimized by placing patches that have similar (e.g. in terms of direction and/or angle) view cones 51 adjacent to one other. As will be explained, combined patch visibility metadata for the patches 47, 48, 49 may be generated additionally or alternatively to individual patch visibility metadata which applies to respective ones of the patches.

The sub-image layout in the first and second atlases 52,53 may also be organized such that it is possible to encode a patch 47, 48,49 or a set of patches having similar visibility cones 51 into spatiotemporal units that can be decoded independently of other spatiotemporal units. For example, a tile grid, as understood in the context of High Efficiency Video Coding (HEVC), may be selected for encoding and the first and second atlases 52,53 may be organized in a manner such that a patch 47, 48,49 or a group of patches having similar visibility cones can be encoded as a motion-constrained tile set (MCTS), as understood in the context of HEVC.

Characteristics of an MCTS may be described as follows. An MCTS is such that the inter prediction process is constrained in encoding such that no sample value outside the motionconstrained tile set, and no sample value at a fractional sample position that is derived using one or more sample values outside the motion-constrained tile set, is used for inter prediction of any sample within the motion-constrained tile set. Additionally, the encoding of 25 an MCTS is constrained in a manner that motion vector candidates are not derived from blocks outside the MCTS. In general, an MCTS may be defined to be a tile set that is independent of any sample values and coded data, such as motion vectors, that are outside the MCTS. In some cases, an MCTS may be required to form a rectangular area. It should be understood that depending on the context, an MCTS may refer to the tile set within a picture or to the respective tile set in a sequence of pictures. The respective tile set may be, but in general need not be, collocated in the sequence of pictures. It needs to be understood that even though some examples and embodiments are described with respect to MCTSs, they could be similarly realized with other similar concepts of independently decodable spatiotemporal units.

In some embodiments, one or more (but not the entire set of) spatiotemporal units may be provided and stored as a track, as is understood in the context of the ISO base media file format, or as any similar container file format structure. Such a track may be referred to as a

-19patch track. Patch tracks may for example be sub-picture tracks, as understood in the context of OMAF, or tile tracks, as understood in the context of ISO/IEC 14496-15.

In some embodiments, the entire coded video bit stream (containing the entire set of spatiotemporal units) may be stored as a track, as understood in the context of the ISO base media file format, or as any similar container file format structure.

In some embodiments, the atlas layouts may be organized such a way that a patch or a set of patches 47,48,49 having similar visibility cones form a spatiotemporal unit that can be extracted and encoded individually as a video bit stream. This may typically mean that the width and height of the spatiotemporal unit stays unchanged over time, at least for several successive frames. Several sub-picture bit streams of the same atlas may be encoded, for example one per patch 47, 48,49 or each set of patches having similar visibility cones. Encoding may be constrained in a manner such that merging of an encoded sub-picture bit stream to another bit stream that can be decoded with a standard-conforming decoder is enabled. For example, sub-picture bit streams may be encoded in a way such that dependencies on samples outside of the decoded picture boundaries are avoided in the encoding by selecting motion vectors in a manner that sample locations outside the picture are not referred to in the inter-prediction process. Sub-picture bit streams maybe stored as 20 patch tracks, as discussed above.

In some embodiments, several versions of the one or more atlases are encoded. Different versions may include, but are not limited to, one or more of the following:

different bitrate versions of the one or more atlases at the same resolution;

different spatial resolutions of the atlases; and different versions for different random access intervals; these may include one or more intra-coded atlases (where every picture can be randomly accessed).

In some embodiments, combinations of patches 47, 48, 49 from different versions of the texture atlas may be prescribed and described as metadata, such as extractor tracks, as will be understood in the context of OMAF and/or ISO/IEC 14496-15. The prescriptions maybe authored on the basis of one or more of the following but are not limited to such.

When the total sample count of a texture atlas and, in some cases, of the respective geometry pictures and/or other auxiliary pictures (if any) exceeds a limit, such as a level limit of a video codec, a prescription maybe authored in a manner so that the limit is obeyed. For example, patches maybe selected from a lower-resolution texture atlas according to subjective

- 20 importance. The selection may be performed in a manner that is not related to the viewing position. The prescription maybe accompanied by metadata characterizing the obeyed limit(s), e.g. the codec Level that is obeyed.

A prescription may be made specific to a visibility cone and hence excludes the patches 47, 48, 49 not visible in the visibility cone. The selection of visibility cones for which the prescriptions are generated may be limited to a reasonable number, such that switching from one prescription to another is not expected to occur frequently. The visibility cones of prescriptions may overlap to avoid switching back and forth between two prescriptions. The prescription may be accompanied by metadata indicative of the visibility cone.

A prescription may use a specific grid or pattern of independent spatiotemporal units. For example, a prescription may use a certain tile grid, wherein tile boundaries are also MCTS boundaries. The prescription maybe accompanied by metadata indicating potential sources (e.g. track groups, tracks, or representations) that are suitable as spatiotemporal units.

Selective Transmission of Patches

Patches 47,48,49 may be selectively transmitted by the patch transcoding and packing module 36 based on visibility. For example, for a group of frames, the metadata for all patches 47,48,49 can be accessed first, and only the patches visible based on the current or predicted viewing parameters received from the client view tracking module 39 may be streamed.

In one example, selective transmission may be controlled by the patch transcoding and packing module 36, i.e. at the server end. However, selective streaming may also be controlled by the client end, i.e. hardware and/or software in the headset 22B. This may be accomplished by having separates streams mapped to individual URIs e.g. based on a manifest or description of media content, or a cloud component (e.g. transcoding/packing in an edge cloud node) or by a streaming server.

Some concepts, structures, and specifications of DASH are described below as an example, based on which the embodiments may be implemented. The aspects of the invention are not limited to DASH, but rather the description is given for one possible basis on top of which the 35 invention may be partly or fully realized.

In DASH, the multimedia content may be stored on an HTTP server and may be delivered using HTTP. The content may be stored on the server in two parts: Media Presentation

- 21 Description (MPD), which describes a manifest of the available content, its various alternatives, their URL addresses, and other characteristics; and segments, which contain the actual multimedia bit streams in the form of chunks, in a single or multiple files. The MDP provides the necessary information for clients to establish a dynamic adaptive streaming over 5 HTTP. The MPD contains information describing media presentation, such as an ΗΤΓΡuniform resource locator (URL) of each segment to make GET segment request. To play the content, the DASH client may obtain the MPD e.g. by using HTTP, email, thumb drive, broadcast, or other transport methods. By parsing the MPD, the DASH client may become aware of the program timing, media-content availability, media types, resolutions, minimum 10 and maximum bandwidths, and the existence of various encoded alternatives of multimedia components, accessibility features and required digital rights management (DRM), mediacomponent locations on the network, and other content characteristics. Using this information, the DASH client may select the appropriate encoded alternative and start streaming the content by fetching the segments using e.g. HTTP GET requests. After appropriate buffering to allow for network throughput variations, the client may continue fetching the subsequent segments and also monitor the network bandwidth fluctuations. The client may decide how to adapt to the available bandwidth by fetching segments of different alternatives (with lower or higher bitrates) to maintain an adequate buffer.

In the context of DASH, the following definitions may be used: a media content component or a media component may be defined as one continuous component of the media content with an assigned media component type that can be encoded individually into a media stream. Media content may be defined as one media content period or a contiguous sequence of media content periods. Media content component type may be defined as a single type of 25 media content such as audio, video, or text. A media stream may be defined as an encoded version of a media content component.

In DASH, a hierarchical data model is used to structure media presentation as follows. A media presentation consists of a sequence of one or more Periods; each Period contains one 30 or more Groups; each Group contains one or more Adaptation Sets; each Adaptation Sets contains one or more Representations; and each Representation consists of one or more Segments. A Group may be defined as a collection of Adaptation Sets that are not expected to be presented simultaneously. An Adaptation Set may be defined as a set of interchangeable encoded versions of one or several media content components. A Representation is one of the 35 alternative choices of the media content or a subset thereof typically differing by the encoding choice, e.g. by bitrate, resolution, language, codec, etc. The Segment contains certain duration of media data, and metadata to decode and present the included media content. A Segment is identified by a URI and can typically be requested by a HTTP GET request. A

- 22 Segment may be defined as a unit of data associated with an HTTP-URL and optionally a byte range that are specified by an MPD.

The DASH MPD complies with Extensible Markup Language (XML) and is therefore specified through elements and attributes as defined in XML. Attributes in an XML document may be identified by a lower-case first letter as well as they may be preceded by a '@'-sign, e.g. (©attribute. To point to a specific attribute (©attribute contained in an element Element, one may write Element@attribute.

In DASH, all descriptor elements are structured in the same way, namely they contain a (©schemeldUri attribute that provides a URI to identify the scheme and an optional attribute (©value and an optional attribute @id. The semantics of the element are specific to the scheme employed. The URI identifying the scheme may be a URN or a URL. Some descriptors are specified in MPEG-DASH (ISO/IEC 23009-1), while descriptors can additionally or alternatively be specified in other specifications. When specified in specifications other than MPEG-DASH, the MPD does not provide any specific information on how to use descriptor elements. It is up to the application or specification that employs DASH formats to instantiate the description elements with appropriate scheme information. Applications or specifications that use one of these elements define a Scheme Identifier in the form of a URI and the value space for the element when that Scheme Identifier is used. The Scheme Identifier appears in the (©schemeldUri attribute. In the case that a simple set of enumerated values are required, a text string maybe defined for each value and this string may be included in the (©value attribute. If structured data is required then any extension element or attribute may be defined in a separate namespace. The @id value may be used to refer to a unique descriptor or to a group of descriptors. In the latter case, descriptors with identical values for the attribute @id may be required to be synonymous, i.e. the processing of one of the descriptors with an identical value for @id is sufficient.

An Initialization Segment may be defined as a Segment containing metadata that is necessary to present the media streams encapsulated in Media Segments. In ISOBMFF based segment formats, an Initialization Segment may comprise the Movie Box ('moov') which might not include metadata for any samples, i.e. any metadata for samples is provided in Movie Fragment ('moof) boxes.

A Media Segment contains certain duration of media data for playback at a normal speed, such duration is referred as Media Segment duration or Segment duration. The content producer or sendee provider may select the Segment duration according to the desired characteristics of the service. For example, a relatively short Segment duration maybe used

-23in a live service to achieve a short end-to-end latency. The reason is that Segment duration is typically a lower bound on the end-to-end latency perceived by a DASH client since a Segment is a discrete unit of generating media data for DASH. Content generation is typically done such a manner that a whole Segment of media data is made available for a server.

Furthermore, many client implementations use a Segment as the unit for GET requests. Thus, in typical arrangements for live services a Segment can be requested by a DASH client only when the whole duration of Media Segment is available as well as encoded and encapsulated into a Segment. For on-demand service, different strategies of selecting Segment duration may be used.

A Segment maybe further partitioned into Subsegments e.g. to enable downloading segments in multiple parts. Subsegments maybe required to contain complete access units. Subsegments may be indexed by Segment Index box, which contains information to map presentation time range and byte range for each Subsegment. The Segment Index box may also describe subsegments and stream access points in the segment by signaling their durations and byte offsets. A DASH client may use the information obtained from Segment Index box(es) to make a HTTP GET request for a specific Subsegment using byte range HTTP request. If relatively long Segment duration is used, then Subsegments may be used to keep the size of HTTP responses reasonable and flexible for bitrate adaptation. The indexing information of a segment may be put in the single box at the beginning of that segment, or spread among many indexing boxes in the segment. Different methods of spreading are possible, such as hierarchical, daisy chain, and hybrid. This technique may avoid adding a large box at the beginning of the segment and therefore may prevent a possible initial download delay.

A DASH Preselection defines a subset of media components of an MPD that are expected to be consumed jointly by a single decoder instance, wherein consuming may comprise decoding and rendering. The Adaptation Set that contains the main media component for a Preselection is referred to as main Adaptation Set. In addition, each Preselection may include 30 one or multiple partial Adaptation Sets. Partial Adaptation Sets may need to be processed in combination with the main Adaptation Set. A main Adaptation Set and partial Adaptation Sets maybe indicated by one of the two means: a preselection descriptor or a Preselection element.

Examples of client-driven streaming operations are described below.

-24In some embodiments, a patch track forms a Representation in the context of DASH. Consequently, the Representation element in DASH MPD may provide metadata on the patch, such as patch visibility metadata, related to the patch track.

In some embodiments, the client rendering module 38 obtains the patch visibility metadata. Such “obtaining” may include but is not limited to one or more of the following.

If the patch visibility metadata is provided in structures within manifests or descriptions of media content, the client rendering module 38 may obtain the metadata together with other 10 necessary metadata to request patches 47,48,49 from the patch transcoding and packing module 36.

If the patch visibility metadata is provided as timed metadata track(s), the tracks may be announced as timed metadata representation(s) in a manifest and may be indicated as associated with related media representation(s). For example, the @associationld attribute may be used in DASH MPD to indicate an association between metadata and media Representations and the @associationType attribute may be used to indicate the type of the association, which may be indicative of patch visibility metadata.

The patch visibility metadata may be provided within an Initialization Segment of the media representation(s). For example, the SampleGroupDescriptionBox may contain a possible set of patch visibility metadata, which maybe considered a superset of dynamic time-varying patch visibility metadata. In this case, the client rendering module 38 may obtain the patch visibility metadata as a part of the Initialization Segment of the media Representation(s) that 25 are considered as sources for fetching sub segments.

The patch visibility metadata may be provided in an initial part of each patch 47,48,49, i.e. as a part of the MovieFragmentBox. For example, the SampleToGroupBox may indicate which entries of the respective SampleGroupDescriptionBox apply in the movie fragment. In 30 this case, the client rendering module 38 may issue one or more HTTP GET requests to obtain an initial part of a sub segment to obtain the patch visibility metadata.

In some embodiments, the client rendering module 38 may obtain information on available prescriptions of patches 47, 48, 49, wherein the information may include but is not limited to 35 one or more the following:

metadata characterizing the limit(s) obeyed by the prescription, e.g. the codec level that is obeyed by the prescription;

-25metadata indicative of the visibility cone provided by the prescription;

information of potential sources (e.g. track groups, tracks, or representations) that are 5 suitable as spatiotemporal units to be included, by reference, into the prescription.

Having received and decoded the information on the available prescription, the client rendering module 38 may use the information for selecting a suitable prescription, e.g. on the basis of one or more of the following:

if the client rendering module’s decoding capacity meets or exceeds that required by the prescription;

if the viewing position and orientation is within the visibility cone provided by the prescription;

In some embodiments, a prescription may be provided within a main Adaptation Set of a

DASH preselection. For example, the main Adaptation Set may contain a Representation that comprises Segments of an extractor track. Partial Adaptation Sets of a DASH preselection may comprise Representations, wherein a Representation may comprise segments of a patch track.

In some embodiments, the client rendering module 38 may select between several versions of equivalent patches 47, 48,49. The client rendering module’s operation may for example comprise one or more of the following steps.

If the patch visibility metadata is indicative of absolute or relative picture quality, the client may obtain such absolute or relative picture quality information. Otherwise, the client rendering module 38 may conclude such absolute or relative picture quality information e.g.

based on bitrate indicated for patches 47, 48, 49 e.g. within DASH MPD.

The client rendering module 38 may select the highest quality or bitrate patches 47,48,49 that are determined to be visible, and in some cases (e.g. when the network throughput is not high enough) in the centre of the viewport. Patches 47,48, 49 that are not visible but which 35 may become visible if the viewing position and/or orientation changes moderately may be selected to be streamed at a lower quality. The client rendering module 38 may determine that patches 47,48,49 which are not visible and not expected to become visible within a reasonable time period are not requested to be streamed.

- 26 In some cases video decoding by the client rendering module 38 may be optimized so that only partial image block of the texture needs to be decoded.

FIG. 6 illustrates a viewing point 60, a first patch 61 and a second patch 62. A decision on whether or not one or both of the first and second patches 61, 62 are culled maybe performed by comparing view vectors 63, 64 (vectors from the viewing point 60 to a patch point on the respective patches) to view cones.

For this purpose, a first point on each of the first and second patches 61, 62 is needed in order to calculate the respective view vectors 63, 64. Typically, patch bounding box corners (comprising eight points) can be used as patch reference points. Each bounding box corner point is tested to determine whether the respective view vector 63, 64 is inside or outside the resulting view cone 61A, 61B. Also, as part of the visibility check, a patch 61, 62 may only be 15 rendered if any of the view vectors are forward-facing. If rear-facing, a patch 61, 62 may not be rendered.

FIG. 7 illustrates the above-mentioned visibility check. A patch 70 is defined by the black lines describing its surface. Normals 71 are illustrated for the surface. Normals 71 that have 20 the most angular difference are illustrated by a normal cone 73. For example, the uppermost and lowermost normal 71 have the most angular difference and hence the normal cone 73 is derived therefrom. A tight bounding box 74 is illustrated by the rectangle, and normal cones 73 are placed at each bounding box corner for testing visibility. An example viewing position 78 is indicated to the right-hand side. Before testing visibility with the normal cones 73, 25 some early checks can be made. The patch 70 is projected to a first plane 74 of the bounding box 74, which is the near plane in terms of the viewing position 78. The depth values at the first plane 74 will be approximate zero. The projection on a second plane 77 (the far plane in this case) will have depth values, e.g. below 255 if 8-bit video is used. If the viewing position 78 is to the right-hand side of the far plane 77, nothing may be culled because the user may be 30 inside the content. If the viewing position 78 is to the right-hand side of the near plane 76, then nothing may be culled as the forward surface will be seen.

If the user is behind the far plane 77, each bounding box corner is tested with the normal cone 73. In FIG. 7, the patch 70 can be culled if the viewing position is in a space represented 35 by the black area 75 as the patch cannot be seen from these viewing locations. The black area may be determined as follows, using in this case the two normal cones 73 located on the far plane 77 for ease of illustration. Each normal cone 73 is “opened” by 180 (90 + 90) degrees and straight lines drawn to define two culling zones 81, 82, referred to respectively as

-οηAcuii and Bcuii. The intersection of these two zones is the black zone 75 which may be culled.

It follows that the same operation can be performed using the other two normal cones 73 on the corners of the near plane 76, although in this case doing so should have no limiting effect on the black zone 75. Also, the above in reality may be performed in volumetric space rather 5 than two-dimensions, which is for ease of illustration.

FIG. 8 is a flow diagram showing processing operations that may be performed by one or more modules shown in FIG. 3 or by individual modules shown in FIG. 3. For example, the processing operations in FIG. 8 may be performed by hardware, software or a combination 10 thereof at a client-end system such as the virtual reality processing apparatus 14. The order of operations is not necessarily indicative of processing order.

An operation 8.1 comprises providing a plurality of patches. Providing may mean receiving or generating the patches.

Another operation 8.2 comprises providing patch visibility information. Providing maybe receiving or generating the patch visibility information. As indicated above, the patch visibility information may be based on a determined visibility cone or bounding sphere or other bounding volume. The patch visibility information may be provided as metadata which 20 may be quantised.

Another operation 8.3 comprises providing one or more viewing positions of a client device.

Providing maybe receiving or generating the one or more viewing positions. For example, the one or more viewing positions may be based on field-of-view or predicted field-of-view 25 information received from a user device such as one or more of the headsets 22A - 22C shown in FIG. 2. The one or more viewing positions may form a contiguous bounded viewing space. A predicted field-of-view may be determined using known methods, such as based on historical movement data for the user, machine -learning techniques, the nature of the scene etc.

Another operation 8.4 comprises processing the patches dependent on whether the patch visibility information indicates that the forward surface of the patch is visible. Said processing may comprise one or more of selectively encoding, transmitting, requesting for transmission, parsing from a container (such as a container file), decoding and rendering.

Further operations may form part of the FIG. 8 processing operations.

- 28 FIG. 9 is a schematic diagram of components of either of the virtual reality processing apparatus 14, or the client rendering module 38 of a client end device, such as any of the headsets 22A - 22C and/or the media player 24 shown in FIG. 2. For ease of explanation, we will assume that the components are those in the virtual reality processing apparatus 14, but it will be appreciated that the following is applicable to the client rendering module 38.

The virtual reality processing apparatus 14 may have a processor too, a memory 104 closely coupled to the processor and comprised of a RAM 102 and ROM 103, and, optionally, hardware keys 106 and a display 108. The server 40 may comprise one or more network interfaces 110 for connection to a network, e.g. a modem which may be wired or wireless.

The processor too is connected to each of the other components in order to control operation thereof.

The memory 104 may comprise a non-volatile memory, a hard disk drive (HDD) or a solid state drive (SSD). The ROM 103 of the memory 104 stores, amongst other things, an operating system 112 and may store software applications 114. The RAM 102 of the memory 104 may be used by the processor too for the temporary storage of data. The operating system 112 may contain code which, when executed by the processor ware components of the 20 server 40.

The processor 100 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor, or plural processors and it may comprise processor circuitry.

The virtual reality processing apparatus 14 may be a standalone computer, a server, a console, or a network thereof.

In some embodiments, the virtual reality processing apparatus 14 may also be associated with external software applications. These may be applications stored on a remote server device and may run partly or exclusively on the remote server device. These applications may be termed cloud-hosted applications. The virtual reality processing apparatus 14 maybe in communication with the remote server device in order to utilize the software application stored there.

In summary, embodiments provide the following advantageous features. Patch creation may be performed based on surface normal similarity, for example using a normal visibility cone. Patch visibility information, or metadata, may be provided in or along the bitsream containing

-29the coded patches. Patches may be grouped having similar visibility information or metadata, for example determined using visibility cones. One or more texture atlases maybe generated, with sub-images being arranged based on the grouping of patches. Encoding may be performed using independently decodable spatiotemporal units, such as motion-constrained 5 tile sets, each enclosing one set of grouped patches. There may be provided an operation of prescribing, e.g. in extractor tracks of ISO/IEC14496-15, how sets of independently decodable spatiotemporal units are merged in the encoded domain as a decodable bit stream.

In arranging the coded content for streaming and/or decoding, embodiments include 10 selecting, for delivery and/or decoding, texture and geometry patches that are determined, based on the patch visibility information, to be visible under current or potential viewing conditions. On the basis of said selecting, one or more of the following may be performed. For example, delivery and/or decoding of the independently decodable spatiotemporal units may be performed on enclosed groups of patches determined to be visible. For example, 15 independently decodable spatiotemporal units may be merged in the coded domain into a decodable bit stream. The merging maybe performed using the prescriptions, such as extractor tracks of ISO/IEC 14496-15. For example, selected patches maybe transcoded into a decodable bit stream.

For the avoidance of doubt, references to virtual reality (VR) are also intended to cover related technologies such as augmented reality (AR) and mixed reality (MR).

References to “providing” may include receiving, transmitting and/or generating.

References to “means” may comprise hardware, software, firmware or a combination thereof.

For example, the means may be a computer, computer controller or processor, or microcontroller, which may operate in association with software code.

References to “viewing position” may be extended to a predicted viewing position associated 30 with the client device, i.e. a future viewing position.

It will be appreciated that the above described embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present application.

Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or

-30of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

Claims

Claims

1. Apparatus comprising:

5 means for providing a plurality of patches representing part of a volumetric scene;

means for providing, for each patch, patch visibility information indicative of a set of directions from which a forward surface of the patch is visible;

means for providing one or more viewing positions associated with a client device; means for processing one or more of the patches dependent on whether the patch io visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.
2. The apparatus of claim 1, wherein the means for providing the plurality of patches is configured to estimate surface normals from points of the volumetric scene, and group

15 together points having similar surface normals to provide a given patch.
3. The apparatus of claim i or claim 2, wherein the processing means is configured to transmit one or more visible patches to the client device and not to transmit one or more non-visible patches to the client device.
4. The apparatus of claim i or claim 2, wherein the processing means is configured to transmit identifiers of one or more visible patches to the client device and not to transmit identifiers of non-visible patches to the client device, the identifiers being usable by the client device to retrieve the patches from a patch server.
5. The apparatus of claim 3, wherein the processing means is configured to generate a texture atlas comprising the plurality of patches, to cull from the texture atlas one or more non-visible patches, and to transmit the texture atlas to the client device for decoding and/or rendering thereat.
6. The apparatus of claim 5, wherein the processing means is configured to generate first and second texture atlases by means of producing, for each patch, a colour image and a depth image, to provide the colour image for a given patch to a respective part of the first texture atlas and the depth image for the given patch to the respective part of the second texture

35 atlas, to cull the colour and depth images corresponding to non-visible patches, and to transmit the culled first and second texture atlases to the client device for decoding and/or rendering thereat.

-327- The apparatus of claim 6, wherein the processing means is configured to project each patch to a two-dimensional geometry to provide the colour image and the depth image.
8. The apparatus of any of claims 4 to 7, wherein the visible patches / texture atlases are 5 transmitted to the client device as video frames.
9. The apparatus of any preceding claim, further comprising means for transmitting metadata to the client device, the metadata being indicative of the patch visibility information for patches.
10. The apparatus of any preceding claim, wherein the metadata is further indicative of decoding parameters and/or requirements ofthe patches.
11. The apparatus of claim 10, wherein the metadata is transmitted to the client device

15 using one or more of:

- supplemental enhancement information (SEI) messages;

- a structure of a containing file format; and

- a manifest or descriptor of media content.

20
12. The apparatus of claim 10 or claim 11, wherein the metadata is quantized.
13. The apparatus of any of claims 9 to 12, wherein the metadata indicates patch visibi 1 ity information by means of a normal vector and an angle.

25
14. The apparatus of any preceding claim, wherein the patch visibility information for a patch is determined by projecting an image of the patch onto a near plane and a far plane using depth information for the patch, and wherein the processing means is configured to identify the patch as visible or non-visible based on which side of the near and fair planes the one or more viewing positions are located.
15. The apparatus of any preceding claim, wherein the patch visibility information for a patch is determined by determining a visibility cone derived from a normal cone direction vector and opening angle that includes approximately the minimum cone from which the forward surface of the patch can be seen and increasing the opening angle by a

35 predetermined amount.
16. The apparatus of claim 15, wherein the visibility cone is determined by increasing the opening angle by approximately 180 degrees.

-3317· The apparatus of any preceding claim, wherein the patch visibility information is determined over a plurality of temporal frames of the volumetric scene.

5 18. The apparatus of any preceding claim, wherein the apparatus is a server for transmitting patches to one or more client devices.
19. The apparatus of any preceding claim, wherein the means for providing a one or more viewing positions further provides one or more predicted viewing positions associated with

10 the client device, and wherein the processing means is configured to process the one or more patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches will be visible from the one or more predicted viewing positions.

15 20. The apparatus of claim 1, wherein the apparatus is a client device, whereby the means for providing the plurality of patches is configured to receive the patches from a server, whereby the means for providing the patch visibility information is configured to receive the patch visibility information from the, or a different, server, and wherein the processing means is configured to decode and/or render one or more patchesdependent on whether the 20 received patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.

21. A method, comprising:

providing a plurality of patches representing part of a volumetric scene;

25 providing, for each patch, patch visibility information indicative of a set of directions from which a forward surface of the patch is visible;

providing one or more viewing positions associated with a client device; processing one or more of the patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches is visible from the 30 one or more viewing positions.

22. The method of claim 21, wherein providing the plurality of patches comprises estimating surface normals from points of the volumetric scene, and grouping together points having similar surface normals to provide a given patch.

23. The method of claim 21 or claim 22, wherein processing the one or more patches comprises transmitting one or more visible patches to the client device not transmitting one or more non-visible patches to the client device.

-3424- The method of claim 21 or claim 22, wherein processing the one or more patches comprises transmitting identifiers of one or more visible patches to the client device and not transmitting identifiers of non-visible patches to the client device, the identifiers being usable 5 by the client device to retrieve the patches from a patch server.

25. The method of claim 23, wherein processing the one or more patches comprises generating a texture atlas comprising the plurality of patches, culling from the texture atlas one or more non-visible patches, and transmitting the texture atlas to the client device for

10 decoding and/or rendering thereat.

26. The method of claim 25, wherein processing the one or more patches comprises generating first and second texture atlases by means of producing, for each patch, a colour image and a depth image, providing the colour image for a given patch to a respective part of

15 the first texture atlas and the depth image for the given patch to the respective part of the second texture atlas, culling the colour and depth images corresponding to non-visible patches, and transmitting the culled first and second texture atlases to the client device for decoding and/or rendering thereat.
20 orj. The method of claim 26, wherein processing the one or more patches comprises projecting each patch to a two-dimensional geometry to provide the colour image and the depth image.

28. The method of any of claims 24 to 27, wherein the visible patches / texture atlases are 25 transmitted to the client device as video frames.

29. The method of any of claims 21 to 28, further comprising transmitting metadata to the client device, the metadata being indicative of the patch visibility information for patches.

30. The method of any preceding claim, wherein the metadata is further indicative of decoding parameters and/or requirements of the patches.

31. The method of claim 30, wherein the metadata is transmitted to the client device

35 using one or more of:

- supplemental enhancement information (SEI) messages;

- a structure of a containing file format; and

- a manifest or descriptor of media content.

32. The method of claim 30 or claim 31, wherein the metadata is quantized.

33. The method of any of claims 29 to 32, wherein the metadata indicates patch visibility 5 information by means of a normal vector and an angle.

34. The method of any of claims 21 to 33, wherein the patch visibility information for a patch is determined by projecting an image of the patch onto a near plane and a far plane using depth information for the patch, and wherein processing the one or more patches

10 comprising identifying the patch as visible or non-visible based on which side of the near and fair planes the one or more viewing positions are located.

35. The method of any of claims 21 to 34, wherein the patch visibility information for a patch is determined by determining a visibility cone derived from a normal cone direction

15 vector and opening angle that includes approximately the minimum cone from which the forward surface of the patch can be seen and increasing the opening angle by a predetermined amount.

36. The method of claim 35, wherein the visibility cone is determined by increasing the 20 opening angle by approximately 180 degrees.

37. The method of any of claims 21 to 36, wherein the patch visibility information is determined over a plurality of temporal frames of the volumetric scene.
25 38. The method of any of claims 21 to 37, performed at a server for transmitting patches to one or more client devices.

39. The method of any of claims 21 to 38, wherein providing the one or more viewing positions further provides one or more predicted viewing positions associated with the client 30 device, and wherein processing the one or more patches comprises processing the one or more patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches will be visible from the one or more predicted viewing positions.

35 40. The method of claim 21, performed at a client device, whereby providing the plurality of patches comprises receiving the patches from a server, whereby providing the patch visibility information comprises receiving the patch visibility information from the, or a different, server, and wherein processing the one or more patches comprises decoding and/or

-36rendering one or more patches dependent on whether the received patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.

5 41. A computer program comprising computer-readable instructions which, when executed by at least one processor, cause the at least one processor to perform the method of any of claims 21 to 40.

42. A non-transitory computer-readable medium having stored thereon computer10 readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising:

providing a plurality of patches representing part of a volumetric scene; providing, for each patch, patch visibility information indicative of a set of directions from which a forward surface of the patch is visible;

15 providing one or more viewing positions associated with a client device;

processing one or more of the patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.

20 43. An apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor:

to provide a plurality of patches representing part of a volumetric scene;

to provide, for each patch, patch visibility information indicative of a set of directions

25 from which a forward surface of the patch is visible;

to provide one or more viewing positions associated with a client device;

to process one or more of the patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions.