WO2021260266A1

WO2021260266A1 - A method, an apparatus and a computer program product for volumetric video coding

Info

Publication number: WO2021260266A1
Application number: PCT/FI2021/050463
Authority: WO
Inventors: Jaakko Olli Taavetti KERÄNEN; Vinod Kumar Malamal Vadakital; Lauri Aleksi ILOLA; Kimmo Tapio Roimela
Original assignee: Nokia Technologies Oy
Priority date: 2020-06-23
Filing date: 2021-06-17
Publication date: 2021-12-30

Abstract

The embodiments relate to method for encoding/decoding, and a technical equipment for the same. The method for encoding comprises receiving a video presentation frame (710), where the video presentation represents a three-dimensional data; generating one or more patches from the video presentation frame (720), wherein the patches represent contours of an object; generating metadata to be associated with said one or more patches (730), wherein the metadata comprises information for mesh-based view synthesis of the volumetric video; and encoding the generated metadata in or along a bitstream of a corresponding patch (740).

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VOLUMETRIC VIDEO CODING

Technical Field

The present solution generally relates to volumetric video encoding and decoding. Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view and displayed as a rectangular scene on flat displays. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene. One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel. Summary

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided a method for encoding, the method comprising:

- receiving a video presentation frame, where the video presentation represents a three-dimensional data;

- generating one or more patches from the video presentation frame, wherein the patches represent contours of an object;

- generating metadata to be associated with said one or more patches, wherein the metadata comprises information for mesh-based view synthesis of the volumetric video; and

- encoding the generated metadata in or along a bitstream of a corresponding patch.

According to a second aspect, there is provided a method for decoding, the method comprising

- receiving an encoded bitstream;

- decoding from the bitstream metadata being associated with one or more patches, the metadata comprising information for mesh-based view synthesis volumetric video;

- decoding from the bitstream one or more patches for a video presentation frame, wherein the patches contain contours of an object; and

- synthesizing a novel view according to the one or more patches by using information obtained from the decoded metadata.

According to a third aspect, there is provided an apparatus for encoding, comprising: - means for receiving a video presentation frame, where the video presentation represents a three-dimensional data;

- means for generating one or more patches from the video presentation frame, wherein the patches represent contours of an object;

- means for generating metadata to be associated with said one or more patches, wherein the metadata comprises information for mesh-based view synthesis of the volumetric video; and

- means for encoding the generated metadata in or along a bitstream of a corresponding patch.

According to a fourth aspect, there is provided an apparatus for decoding comprising

- means for receiving an encoded bitstream;

- means for decoding from the bitstream metadata being associated with one or more patches, the metadata comprising information for mesh-based view synthesis volumetric video;

- means for decoding from the bitstream one or more patches for a video presentation frame, wherein the patches contain contours of an object; and

- means for synthesizing a novel view according to the one or more patches by using information obtained from the decoded metadata.

According to a fifth aspect, there is provided an apparatus for encoding, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive a video presentation frame, where the video presentation represents a three-dimensional data;

- generate one or more patches from the video presentation frame, wherein the patches represent contours of an object;

- generate metadata to be associated with said one or more patches, wherein the metadata comprises information for mesh-based view synthesis of the volumetric video; and

- encode the generated metadata in or along a bitstream of a corresponding patch.

According to a sixth aspect, there is provided an apparatus for decoding, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: - receive an encoded bitstream;

- decode from the bitstream metadata being associated with one or more patches, the metadata comprising information for mesh-based view synthesis volumetric video;

- decode from the bitstream one or more patches for a video presentation frame, wherein the patches contain contours of an object; and

- synthesize a novel view according to the one or more patches by using information obtained from the decoded metadata.

According to a seventh aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to

According to an eighth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to

- receive an encoded bitstream;

- synthesize a video presentation frame according to the one or more patches by using information obtained from the decoded metadata.

According to an embodiment, the metadata comprises a matrix of level-of-detail values. According to an embodiment, the metadata comprises information on motion vector offsets.

According to an embodiment, the metadata comprises residual motion vectors being determined based on motion vectors extracted from compressed two-dimensional video.

According to an embodiment, the metadata comprises information on a maximum depth threshold value defining how far apart in the depth dimension vertices of a triangle are allowed to be.

According to an embodiment, the metadata comprises information on at least one optional depth slicing value, being a value between the minimum and maximum depth values of the patch, falling between distinct depth layers of the patch.

According to an embodiment, the metadata comprises information on per-patch factor values being applied to adjusted vertices on a near layer and a far layer.

According to an embodiment, the metadata comprises values for a surface estimation function, describing rough geometry of a surface layer inside the patch.

According to an embodiment, the metadata comprises a matrix of flags, wherein the flags indicate a presence of depth contours, and wherein the matrix covers an area of a patch.

According to an embodiment, the metadata comprises one or more depth offsets that are applied to mesh vertices before beginning ray casting.

According to an embodiment, the computer program product is embodied on a non- transitory computer readable medium.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of a volumetric video compression process; Fig. 2 shows an example of a volumetric video decompression process;

Fig. 3 shows an example of a matrix of LOD values, and an example of a method of applying the LOD values to a triangle mesh, interpreting each LOD value as a subdivision count for mesh triangles;

Fig. 4 shows an example of triangles of a mesh affected by contours inside a patch;

Fig. 5 shows an example of depth curbing on undefined vertices;

Fig 6 shows an example of a mesh offset for raycasting;

Fig. 7A, 7B are flowcharts illustrating methods according to an embodiment; and

Fig. 8 shows an apparatus according to an embodiment.

Description of Example Embodiments

In the following, several embodiments will be described in the context of volumetric video encoding and decoding, where dynamic three-dimensional (3D) objects or scenes are encoded into video streams. The encoded video streams are delivered for decoding, and the decoded video streams are provided playback.

A video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can un-compress the compressed video representation back into a viewable form. An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate).

Volumetric video refers to a visual content that may have been captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.

Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two- dimensional (2D) plane. However, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example.

Volumetric video data represents a three-dimensional scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality) and MR (Mixed Reality) applications. Such data describes geometry (shape, size, position in three- dimensional space) and respective attributes (e.g. color, opacity, reflectance, ...), plus any possible temporal changes of the geometry and attributes at given time instances (like frames in two-dimensional (2D) video). Volumetric video is either generated from three-dimensional (3D) models, i.e. CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g. multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Representation formats for such volumetric data comprises triangle meshes, point clouds, or voxels. Temporal information about the scene can be included in the form of individual capture instances, i.e. volumetric video frame.

Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.

Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi level surface maps.

In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes, voxel suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D space is an ill-defined problem, as both the geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes may be inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.

Instead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more geometries. These geometries are “unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which may be then encoded using standard 2D video compression techniques. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).

Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency may be increased greatly. Using geometry-projections instead of prior-art 2D-video based approaches, i.e. multiview and depth, provide a better coverage of the scene (or object). Thus, 6DOF capabilities may be improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/de compression of the projected planes. The projection and reverse projection steps are of low complexity.

Figure 1 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.

The patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1.0, 0.0, 0.0),

- (0.0, -1.0, 0.0), and

- (0.0, 0.0, -1.0)

More precisely, each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal and the plane normal).

The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.

Patch info determined at patch generation 102 for the input point cloud frame 301 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105. The packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g. 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user-defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.

The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.

The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images, respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, D0+AJ, where A is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:

• Geometry: WxH YUV420-8bit,

• Texture: WxH YUV420-8bit,

It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.

The geometry images and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted, respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.

The padding process 107, for which the present embodiment are related, aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.

The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example the H.265 video codec according to the video codec configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.

The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.

For example, the following metadata may be encoded/decoded for every patch:

- index of the projection plane o Index 0 for the planes (1 .0, 0.0, 0.0) and (-1 .0, 0.0, 0.0) o Index 1 for the planes (0.0, 1 .0, 0.0) and (0.0, -1 .0, 0.0) o Index 2 for the planes (0.0, 0.0, 1 .0) and (0.0, 0.0, -1 .0)

- 2D bounding box (uO, vO, ul, vl) - 3D location (xO, yO, zO) of the patch represented in terms of depth 50, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (dq, sO, rO) may be calculated as follows: o Index o Index o Index

Also, mapping information providing for each TxT block its associated patch index may be encoded as follows:

- For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.

- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.

- Let / be index of the patch, which the current TxT block belongs to, and let J be the position of / in L. Instead of explicitly coding the index /, its position J is arithmetically encoded instead, which leads to better compression efficiency.

An example of such patch auxiliary information is atlas data defined in ISO/IEC 23090-5.

The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.

The occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map. The compression process may comprise one or more of the following example operations:

• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.

• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.

• A binary information may be encoded for each TxT block to indicate whether it is full or not.

• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.

^■ The binary value of the initial sub-block is encoded.

^■ Continuous runs of 0s and 1s are detected, while following the traversal order selected by the encoder.

^■ The number of detected runs is encoded.

^■ The length of each run, except of the last one, is also encoded.

Figure 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202. In addition, the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204. Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images. The reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202. The texture reconstruction 207 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.

The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (SO, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth S(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:

S(u, v) = SO + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.

For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction.

Visual volumetric video-based Coding (V3C, sometimes also 3VC) relates to a core part shared between ISO/IEC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video)). V3C will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1-8 of the current V-PCC text). ISO/IEC 23090-12 will refer to this common part. ISO/IEC 23090-5 is expected to be renamed to V3C PCC, ISO/IEC 23090-12 renamed to V3C MIV.

In V3C, the 3D scene is segmented into a number of regions according to heuristics based on, for example, spatial proximity and/or similarity of the data in the region. The segmented regions are projected into 2D patches, where each patch contains at least a depth channel. The depth channel contains information based on which the 3D position of the surface pixels can be determined. The patches are further packed into an atlas that can be streamed as a regular 2D video.

While patches generally represent continuous surfaces, they also may contain two types of contours that represent discontinuities. An occupancy contour exists along the edges of a patch, and also along any undefined (i.e. blank) regions inside a patch. A depth contour exists where the depth values of adjacent patch pixels have a relatively large delta, for example, due to a foreground object appearing in front of a distant background.

When synthesizing views for an end-user, the view synthesis requires knowledge of a viewing device and viewer’s virtual 3D position inside the scene. The synthesizer translates the provided 2D patches into 3D space, thus compositing out of them a subset of the full scene for visualization.

At least two approaches for real-time view synthesis exist: 1 ) point clouds; and 2) triangle meshes:

• In point cloud rendering, each pixel of a patch becomes an independent point that gets projected and rendered in 3D space. It may be necessary to take special measures to ensure that the points form connected 3D surfaces - at a minimum, points must be scaled larger as the viewer gets closer to them. Rendering millions of individual points may be challenging for GPUs (Graphics Processing Unit) as this is a use case that bottlenecks the vertex processing part of the graphics pipeline. GPUs may also provide general- purpose compute functionality that enables writing the points manually into an image buffer.

• Triangle mesh rendering uses the traditional triangle rasterization pipeline that GPUs have been primarily designed for, which still remains the primary way that real-time computer graphics are rendered. GPUs are able to render millions of triangles efficiently. The downside is that geometric detail may be lost if the number of triangles in the mesh is too small.

A view synthesizer may additionally employ ray casting as a rendering technique. In this case, 3D rays are cast from the eye through each pixel of the output frame, and their intersection is calculated with the 3D surfaces of the scene. Modern GPUs at the time of this specification such as those in Nvidia’s RTX series are starting to provide hardware acceleration for this type of rendering, but it remains slower than triangle-based rasterization - its strengths lie in accurate modelling of light according to the laws of physics.

GPUs support instanced drawing, where a set of vertex data is copied to GPU memory, where it remains unmodified for longer periods of time. The same vertex data may then be used multiple times for drawing separate scene objects. The benefit is that the GPU can do more work independently without having to receive new draw commands and/or vertex data from the CPU (Central Processing Unit), which makes rendering more efficient.

V3C metadata can be carried in vpcc_unit, which comprises header and payload pairs. The unit header identifies the type of payload, whereas the payloads carry both video bitstreams and metadata bitstreams depending on the type of payload. An example of the syntax for vpcc_unit and vpcc_unit_header and vpcc_unit_payload structures is presented in below:

V3C metadata may be contained in atlas_sub_bitstream() which may contain a sequence of NAL units including header and payload data. nal_unit_header() is used to define how to process the payload data. NumByteslnNalllnit specifies the size of the NAL unit in bytes. This value is required for decoding of the NAL unit. Some form of demarcation of NAL unit boundaries is necessary to enable inference of NumByteslnNalUnit. One such demarcation method is specified in Annex C (23090-5) for the sample stream format. V3C atlas coding layer (ACL) is specified to efficiently represent the content of the patch data. The NAL is specified to format that data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes. A NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet- oriented transport and sample streams is identical except that in the sample stream format specified in Annex C (23090-5), each NAL unit can by preceded by an additional element that specifies the size of the NAL unit.

In the nal_unit_header() syntax, nal_unit_type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 7.3 of 23090-5. naljayerjd specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. The value of naljayerjd shall be in the range of 0 to 62, inclusive. The value of 63 may be specified in the future by ISO/IEC. Decoders conforming to a profile specified in Annex A of the current version of 23090-5 shall ignore (i.e. remove from the bitstream and discard) all NAL units with values of naljayerjd not equal to 0. rbsp_byte[ i ] is the i-th byte of an RBSP. An RBSP is specified as an ordered sequence of bytes as follows:

The RBSP contains a string of data bits (SODB) as follows:

- If the SODB is empty (i.e., zero bits in length), the RBSP is also empty.

- Otherwise, the RBSP contains the SODB as follows:

• The first byte of the RBSP contains the first (most significant, left-most) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain.

• The rbsp railing_bits( ) syntax structure is present after the SODB as follows: o The first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any) o The next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit). o When the rbsp_stop_one_bit is not the last bit of a byte-aligned byte, one or more bits equal to 0 (i.e. instances of rbsp_alignment_zero_bit) are present to result in byte alignment. • One or more cabac_zero_word 16-bit syntax elements equal to 0x0000 may be present in some RBSPs after the rbsp_trailing_bits( ) at the end of the RBSP.

Syntax structures having these RBSP properties are denoted in the syntax tables using an "_rbsp" suffix. These structures are carried within NAL units as the content of the rbsp_byte[ i ] data bytes. As an example typical content:

- atlas_sequence_parameter_set_rbsp( ) is used to carry parameters related to a sequence of V3C frames.

- atlas_frame_parameter_set_rbsp( ) is which is used to carry parameters related to a specific frame. Can be applied fora sequence of frames as well.

- sei_rbsp( ) is used to carry SEI messages in NAL units.

- atlas_tile_group_layer_rbsp( ) is used to carry patch layout information for tile groups.

When the boundaries of the RBSP are known, the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, right-most) bit equal to 1 , and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0. The data necessary for the decoding process is contained in the SODB part of the RBSP. atlas_tile_group_layer_rbsp() contains metadata information for a list of tile groups, which represent sections of frame. An example of an atlas tile group layer RBSP syntax is presented below:

Each tile group may contain several patches for which the metadata syntax is described below:

Annex F of V3C V-PCC specification (23090-5) describes different SEI messages that have been defined for V3C MIV purposes. SEI messages assist in processed related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non-essential. V3C SEI messages are signalled in sei_rspb(), an example of which is given below:

Non-essential SEI messages are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance.

Specification for presence of non-essential SEI messages is also satisfied when those messages (or some subset of them) are conveyed to decoders (or to the HRD) by other means not specified in V3C V-PCC specification (23090-5). When present in the bitstream, non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5). When the content of a non-essential SEI message is conveyed for the application by some means other than presence within the bitstream, the representation of the content of the SEI message is not required to use the same syntax specified in Annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are actually present in the bitstream are counted.

Essential SEI messages are an integral part of the V-PCC bitstream and should not be removed from the bitstream. The essential SEI messages are categorized into two types:

Type-A essential SEI messages: Type-A SEI messages contain information required to check bitstream conformance and for output timing decoder conformance. A V-PCC decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance.

Type-B essential SEI messages: V-PCC decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type-B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes.

VR and AR necessitate rendering the view at 60-90 Flz, even if the volumetric video itself is encoded at a lower frame rate. Otherwise, view synthesis cannot adequately respond to viewer’s movements (along 6DOF) causing well-known problems such as motion sickness and virtual objects not convincingly blending into the real world.

View synthesis of volumetric video is computationally expensive, especially when large and/or complex scenes are decoded and rendered with a large number of patches. Devices like high-end gaming PCs (Personal Computers) can handle view synthesis of such scenes. Flowever, devices with mobile-class GPUs (such as standalone VR head-mounted devices) have a limited amount of capability for processing graphics, which necessitates finding ways to synthesize views more efficiently. The need to decode one or more video streams in 4K resolution places further restrictions on how complex scenes and view synthesis can be.

Mobile devices rely on battery power, and therefore simple and efficient solutions are preferred as they can run longer and with less thermal throttling of the CPU and/or GPU. This remains true even as mobile processing units become more powerful and efficient in the future.

The 2D patches and their associated metadata contain sufficient information for synthesizing views, however one’s chosen rendering method may benefit from additional information that is not efficient to compute after decoding or during view synthesis. Particularly, when using triangle meshes to render the 2D patches, there is no information provided how such meshes should be structured and how many triangles should be used.

When it comes to mesh-based rendering, a drawback is that patches contain fine per-pixel details (particularly occupancy and depth contours) that are not accurately represented by a relatively low-resolution triangle mesh. Furthermore, a real-time renderer has limited time to determine how to adapt the triangle mesh to the contours. This leads to visual artefacts such as blocky contours or small objects that are completely missing.

The aim of the present embodiments is to provide metadata for 2D-video based volumetric video compression such as MIV. Thus the present embodiments propose a method for signalling metadata in such a manner that enables a real-time immersive video view synthesizer to operate more efficiently and with better visual fidelity. The present embodiments provide one or more LOD (Level Of Detail) values that influence how many vertices the view synthesizer should allocate for a certain region of a patch. These LOD values can be determined by the encoder according to the contents of the patch and the type of changes that occur in the patch over a GOP (Group Of Pictures). The present embodiments provide a) per-patch flags that determine which types of contours are present in a patch; b) per-patch depth thresholds and/or depth slicing parameters; c) per-patch simplified mathematical model for generating vertices for undefined patch pixels; d) per-patch contour vertex depth shifting parameters. e) per-patch motion parameters for shifting vertices along the surface. f) depth offset value(s) that are applied to the mesh to enable more efficient ray casting from the mesh to the actual patch surface.

Mesh-based rendering of patch-based volumetric video comprises rendering each patch as a rectangular triangle mesh. This can be made more efficient by preloading a small number of meshes to GPU memory and then using them repeatedly for all the patches via GPU geometry instancing. The Tenderer needs to ensure that that the overall level of detail remains consistent for all patches, for example by using the finer meshes (with a larger number of vertices/triangles) for the larger patches and the coarse meshes for smaller patches.

The following sections discuss specific features of the embodiments. The present embodiments can be applied separately or in any combination to improve the efficiency and visual quality of mesh-based rendering.

First embodiment: per-patch level-of-detail values:

Without information about the type of geometry that exists in a (depth) patch, the Tenderer cannot adapt the mesh representation accordingly. This causes either important details to be lost due to sparsity of vertices (not sampling enough depth pixels), or overuse of vertices where the geometry has few details (e.g. a flat floor).

Figure 3 illustrates an example of a matrix of LOD values 301 and an example on how the LOD values can be applied to a triangle mesh 302. In this example, the LOD value is directly interpreted as the level of subdivision. However, the LOD values can also be used as weights for controlling the number of triangles per region.

The encoder is configured to encode metadata in or along a bitstream, wherein the metadata comprises a matrix of LOD values. An example of a matrix of LOD values 301 partitioning the area of a patch is shown in Figure 3. The dimensions of the matrix can also be included in the metadata, or they can be inferred from the patch sizes (in pixels, or degrees in view space), for example:

• 1 LOD value covering the entire patch (e.g. a small patch) • 4 LOD values = 2 x 2 matrix

• 16 LOD values = 4 x 4 matrix

Each LOD value in the matrix is a small integer, for example represented with 3 bits for eight discrete values. The LOD matrices may be stored in patch_data_unit() and flagged behind a gate, which makes them optional to use. As an example, pdu_lod_matrix_present_flag may be used to signal that patch contains LOD values and pdu_lod_matrix_size may be used to signal the size of the LOD matrix. Alternatively, just pdu_lod_matrix_size may be used and the presence of LOD matrix per patch is derived by checking if pdu_lod_matrix_size is greater than zero. Instead of storing LOD matrices in patch_data_unit(), a more flexible model of defining a new SEI message may be used. This allows new storage possibilities also from file format encapsulation point of view (23090-10). As an example, patch_lod_matrix_information() may be used to indicate mapping of LOD matrices to patch indices for each tile group.

The Tenderer may utilize the LOD values to determine how many triangles to allocate to each region of a mesh. For example, the LOD values can be used as weights to influence the amount of triangle subdivision. This allows the Tenderer to retain control over the total amount of vertices being used to render a frame. Utilization of LOD values may be a dynamic/GPU-side operation where a low-resolution mesh is tessellated based on the provided values, or the Tenderer may select the most suitable mesh from a set of pre-generated static meshes. (In practice, dynamic tessellation may not be available when using mobile-class GPUs).

The metadata can be generated by the volumetric video encoder once the patches of a frame/GOP are available for analysis. Alternatively, an additional component (“edge server in the cloud”) intermediating transmission to the viewer can decode the patches, perform the analysis and generate the metadata. In this configuration, the metadata generation can adapt to the properties of the viewing device.

In addition to looking for fine geometric details (high-frequency changes in the depth values), high LOD values can be chosen for patches that contain moving objects. Representing moving objects with more vertices reduces artefacts caused by an object “moving under” a static mesh, leading to the vertices visibly shifting positions as they sample different parts of the moving object’s surface. Once the patches have been analyzed, the final LOD values are normalized to the available value range in the metadata, so that the areas with the most detail / most motion get the highest LOD values.

Second embodiment: Per-patch motion residual:

Surfaces projected onto patches may move during a GOP, while the triangle mesh representing the patch in a synthesized frame remains fixed to the patch bounds. This causes a visual artefact where object are “moving under” the mesh, leading to mesh vertices visibly shifting positions as they sample different parts of the moving object’s surface. To minimize this artefact, mesh vertices should move together with the surface. Because of the characteristics of the video codec, information about motion within 2D video frames is available. Once the 2D video frame motion vectors have been extracted, they can be converted to equivalent 3D motion vectors according to the projection used in each patch. Due to the noise inherent in coded video motion vectors, filtering such as spatial median and/or temporal smoothing may be applied to the motion vectors. This gives an offset that can be applied to the vertices of the mesh. However, the Tenderer must ensure that the offset is not greater than the distance between mesh vertices, and that the vertices are clamped within the bounds of the patch.

The encoder is configured to encode in or along a bitstream the following information as metadata per-patch: a flag for enabling motion vector offsets for mesh vertices; and/or a number value (e.g., normalized floating-point 0...1 ) for controlling how strongly the motion vector offsets influence the mesh; and/or residual motion vectors in a 2D matrix. Motion vectors extracted from a compressed 2D video may not accurately describe the motion of surfaces. To account for this, residual motion vectors are computed. These can be generated by the encoder or an intermediary by

1 ) determining motion vectors from each frame,

2) applying them to the mesh,

3) determining the amount of vertex depth volatility remaining for patch mesh vertices,

4) selecting additional vectors to further reduce the volatility. These additional vectors are provided as metadata. Third embodiment: Per-patch depth thresholds for triangle edges:

Rendering depth contours between foreground and background objects presents a problem when using triangle meshes. Figure 4 illustrates an example of how triangles of a mesh can be allocated for a patch. If the regions covered by the triangles 401 are rendered like any other region, it leads to malformed geometry as the vertices of a triangle 401 may be placed too far apart from each other - some on the near side of the contour and some on the far side. This also produces a significant visual artefact as the actual contour is replaced by triangle edges. To solve this problem without additional metadata, the Tenderer needs to sample the neighborhood of each vertex of a triangle 401 to see if a depth contours is present within the triangle. This leads to inefficiency due to unnecessary resampling of vertices and oversampling of their neighborhood. Also, the Tenderer may not have time to sample the patch accurately enough, forcing reliance on approximations that may reduce visual quality.

The encoder is configured to encode in or along the bitstream the following information as metadata. Per patch, the information used as metadata comprises a maximum depth threshold value that determines how far apart the vertices of a triangle are allowed to be in the “depth” dimension. This information may be embedded in patch_data_unit() as specified in 23090-5. Conditionally, it may be stored behind a flag indicating the presence of such information. As in the previous embodiment, a new SEI message may be defined to provide desired mapping of patches and depth thresholds.

The depth threshold value may be represented in different ways (as a single floating point or fixed-point value):

- maximum delta of normalized disparity values (easy to compare against values stored in the depth atlas)

- maximum delta of distances from origin (after projecting the vertices to 3D space)

- maximum ratio of near/far depths (allowing larger differences when the surface is far away)

The metadata generator may choose the most appropriate threshold representation depending on the contents of the patch. During rendering, if one of the vertices is detected to be an outlier in the depth dimension with regard to this threshold, the triangle is rendered twice: first so that the near depth layer is rendered, and a second time so that the far depth layer is rendered. In the fragment shading stage, pixels are discarded if they belong to the wrong depth layer. The result is that the depth contour is rendered accurately, with both near and far layers present, with a per-pixel contour separating them.

Fourth embodiment: Per-patch depth slicing parameters:

The fourth embodiment is addressed to depth contours (see triangles 401 (around the perimeter of the sphere) in Figure 4).

The encoder is configured to encode in or along a bitstream the following information as metadata. Per-patch, the information comprises at least one optional depth slicing value. The depth slicing value is a depth value between the minimum and maximum depth values of the patch. The slicing value may be represented in the same format as the depth pixels themselves (making it straightforward to apply it during rendering), or it can be a normalized value with a reduced number of bits. In practice, the slice value can be a normalized disparity (1.0/depth), because that provides more detail for the low end of the depth range (i.e., regions close to the predetermined end-user viewing volume).

More than one slicing value may be provided in case the patch has several distinct depth layers. Depth slicing values may also be composed into a matrix so that they specify the slicing for specific regions of a large patch. For example, if each depth slice is represented by 8 bits, one 16-bit unsigned integer can specify three slices: near-to-A, A-to-B, and B-to-far (with A and B being 8-bit normalized values). Another option is 1 +7+8 bits, where

MSB is set: one 7-bit value and one 8-bit value; selecting the most accurate representation of the slice values based on the available bits.

MSB is not set: one 15-bit value for a single slice.

This allows having a matrix where some regions have more slices than others without having to allocate a varying number of bits for the elements. Such a scheme is advantageous for rendering, as it can be copied to GPU memory without modifications and sampled directly using shaders. The same principle can be applied for a different number of bits. It is appreciated that since depth layers may be separated by a relatively long distances (being clearly distinct surface layers), slice values do not need to be very accurate - it is enough that they fall somewhere within the empty space between the layers.

The depth slicing parameters may be stored in patch_data_unit() and gated behind a flag signalling the presence of such information. As an example, pdu_depth_slicing_parameters_present_flag may be used to indicate the presence of depth slicing parameters. Additionally, pdu_depth_slicing_type indicator may be used to signal, which kind of depth slicing parameters should be used for given patch. This informs the rendered, if it should expect one or more depth slicing values or if the depth slicing values are signalled as matrices. Alternatively, a new SEI message may be considered for storing depth slicing information per patch.

When rendering a (region of a) patch that has depth slicing value(s), the affected triangles are rendered multiple times. First, the nearest depth range is rendered, while adjusting the triangle vertices so that they adhere to the sliced depth range. In the fragment shading stage, the depth values of individual pixels are compared against the current depth slice range, so the contour can be masked accurately. Then, the next depth range is rendered in similar fashion, continuing until all the slices are rendered. Compared to depth thresholds, this approach is more efficient as there is less GPU-side conditional behavior based on vertex depths.

Fifth embodiment: Depth contour curbing parameters:

This embodiment is addressed to depth contours, an example of which is shown in Figure 5 illustrating depth curbing on undefined vertices 501 . A and B represent the same triangle that is being rendered at two different depth layers. After determining a depth for the undefined vertices on depth layer A, the curbing factor is applied to move them farther away to account for the scene object having a curved contour. Similarly, undefined vertices 501 on a depth layer B are moved forward by applying another curbing factor.

When a triangle is found to have a depth contour and it is rendered as multiple layers, the Tenderer still needs to select plausible depth values for the vertices of the triangle that get adjusted to fit in the current depth slice. If these depth values are selected poorly, the contour may appear to have the wrong shape. In this embodiment, it is assumed that the Tenderer selects the depth values based on the other vertices 502 of the triangle (the ones that are already inside the correct depth slice).

The encoder is configured to encode in or along a bitstream the following information as metadata. The information comprises a per-patch factor value that is applied to adjusted vertices on the near layer, and another per-patch factor value that is applied to adjusted vertices on the far layer. The effect is realized from Figure 5. The result is that the surface curvature along the contour is better represented by the triangles.

The per-patch depth contour curbing parameters may be store inside patch_data_unit() and gated with a flag which indicates the presence of such values. As an example, pdu_depth_contour_parameters_present_flag may be used for such signalling. Behind the flag, near and far factors for depth contours may be signalled. Alternatively, these may be signaled with a new SEI message.

Sixth embodiment: Per-patch surface estimation function:

Occupancy contours can be considered as one of the drawbacks for a mesh-based rendered (see triangles 402 in Figure 4). In this situation, some vertices of a triangle have an undefined depth, so the Tenderer has to choose plausible depth values for them based on other available information. For example, it can look at the other vertices of the triangle, but that may result in mistakes especially when it comes to slanted surfaces. The problem can be alleviated by sampling additional pixels inside the patch, and extrapolating based on them a plausible position for missing vertices (for instance by estimating the local surface normal). Flowever, this may take too much time to do on-the-fly during rendering.

The encoder is configured to encode in or along a bitstream the following information as metadata. The information comprises a small number of parameters for a 2D function that determines the surface depth at a given XY coordinate inside the patch. For example:

- 3D plane: 4 floating-point values, comprising a 3D normal vector and a Z offset; - Pyramid (plane and central offset): 5 floating-point values, comprising a 3D normal vector and a Z offset, plus an additional Z offset for the center of the patch;

- Sphere: 4 floating-point values, comprising the sphere origin and a radius - can be used for approximating curved surfaces;

- 3D spline for more free-form surfaces.

The type of estimation function may be provided as an additional metadata value. The surface estimation function, and its parameters may be signalled inside patch_data_unit() and gated behind a flag signalling the existence of such values. As an example, pdu_surface_estimation_function_present_flag may be used to provide per patch signalling information for existence of such surface estimation functions. Additionally, pdu_surface_estimation_function_type may be used to signal which type of surface estimation function should be used and how surface estimation parameters should be interpreted. The parameters for the surface estimation function may be stored as describe above. Alternatively, a new SEI message may be defined to signal information related to surface estimation functions and parameters per patch.

Generating the metadata for the estimation can be computationally expensive, as it involves fitting various models onto the patch contents to find the best approximation. The objective is to use as few parameters as possible to provide a plausible estimation of the depth values inside the patch. The Tenderer can then use this estimation to efficiently generate depth values for any undefined vertices. Due to the continuity of the estimation outside the defined surface, the undefined vertices will have positions that retain the correct surface positioning for the defined parts of the triangle.

It is appreciated that the surface estimation function is only needed for patches that contain occupancy contours, so the presence of this metadata can also be used by the Tenderer to infer that there are occupancy contours in the patch.

Seventh embodiment: Per-patch depth contour flags:

Without predetermined information about what kind of contours exist in a patch, the Tenderer needs to detect them on-the-fly when sampling patch contents. This makes it less efficient to render each patch. The seventh embodiment provides metadata to signal the presence of depth contours more locally.

The encoder is configured to encode in or along a bitstream the following information as metadata. The information comprises a matrix of flags covering the area of a patch. The flags describe the contents of the patch, whether depth contours are present or not. The dimensions of the matrix can also be included in the metadata, or they can be inferred from the patch sizes (in pixels or degrees):

• 1 bit = 1 flag (covering entire patch, e.g., if the patch is small)

• 1 byte = 8 flags (e.g., 2x4 matrix; many patches are not square)

• 2 bytes = 16 flags (e.g., 4x4 matrix)

• 8 bytes = 64 flags (e.g., 8x8 matrix)

• 32 bytes = 256 flags (e.g., 16x16 matrix)

Similarly to storing per patch LOD matrices, per patch depth contour flags may be stored inside patch_data_unit(). The same mechanism for storing either flag or size information for the depth contour matrix to signal the presence of such information may be utilized.

When rendering, these flags are used for splitting the patch into multiple smaller meshes. Only the parts that have a depth contour need complex geometry and complex shading programs; the rest have a simple continuous surface layer. Each part of the subdivided mesh can also be individually culled versus the viewing frustum, making rendering more efficient. This subdivision and culling are especially useful for larger patches, for example ones that cover 180 degrees via equirectangular projection.

Eighth embodiment: Depth offset for ray casting:

Triangle meshes are also useful as part of a more advanced rendering technique. For instance, the surface can be rendered as a mesh, but during rasterization, the fragment shader uses ray casting to find the accurate intersection point for each output pixel. Figure 6 illustrates an example of a mesh offset for raycasting. The original mesh is translated toward the source of the view rays, so that ray casting can be used to find the accurate surface of the scene object. This enables all details of the surface geometry to be fully represented in the output, even though the mesh has a much lower resolution.

The encoder is configured to encode in or along a bitstream the following information as metadata. Per-patch, the information comprises one or more depth offsets that are applied to the mesh vertices before ray casting begins. When there is more than one offset provided, they can be arranged into a matrix and they may also be interpolated when applying to the mesh vertices.

Similarly to storing per patch LOD matrices, per patch depth offsets for raycasting may be stored inside patch_data_unit(). The same mechanism for storing either flag or size information for the depth offset matrix to signal the presence of such information may be utilized.

This ensures that the mesh triangles are in front of the actual surface, and the casted ray can assume that the first intersection represents the surface point that should be visible.

The advantage of having this metadata is that the rendering technique can be changed dynamically between normal mesh-based rendering and ray-casting-via- mesh. Some patches can be rendered as regular triangle meshes, while selected ones can use a higher-fidelity rendering using ray casting depending on the time available to render the frame.

It is appreciated that the presence of this metadata can be used as a hint by the view synthesizer when selecting the appropriate rendering method of a patch. The encoder is able to analyze the contents of each patch and determine if ray casting would be beneficial and provide the ray casting offsets where appropriate.

A method according to an embodiment is shown in Figure 7a. The method generally comprises at least receiving 710 a video presentation frame, where the video presentation represents a three-dimensional data; generating 720 one or more patches from the video presentation frame, wherein the patches represent contours of an object; generating 730 metadata to be associated with said one or more patches, wherein the metadata comprises information for mesh-based view synthesis of the volumetric video; encoding 740 the generated metadata in or along a bitstream of a corresponding patch. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises at least means for receiving a video presentation frame, where the video presentation represents a three- dimensional data; means for generating one or more patches from the video presentation frame, wherein the patches represent contours of an object; means for generating metadata to be associated with said one or more patches, wherein the metadata comprises information for mesh-based view synthesis of the volumetric video; means for encoding the generated metadata in or along a bitstream of a corresponding patch. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 7a according to various embodiments.

A method according to another embodiment is shown in Figure 7b. The method generally comprises at least receiving 750 an encoded bitstream; decoding 760 from the bitstream metadata being associated with one or more patches, the metadata comprising information for mesh-based view synthesis volumetric video; decoding 770 from the bitstream one or more patches for a video presentation frame, wherein the patches contain contours of an object; synthesizing 780 a novel view according to the one or more patches by using information obtained from the decoded metadata. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to another embodiment comprises at least means for receiving an encoded bitstream; means for decoding from the bitstream metadata being associated with one or more patches, the metadata comprising information for mesh-based view synthesis volumetric video; means for decoding from the bitstream one or more patches for a video presentation frame, wherein the patches contain contours of an object; means for synthesizing a novel view according to the one or more patches by using information obtained from the decoded metadata. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 7b according to various embodiments.

An example of an apparatus is disclosed with reference to Figure 8. Figure 8 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head- mounted display device. The apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.

The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.

The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.

The various embodiments of the view synthesizer having metadata may provide advantages. For example, it enables VR rendering on mobile-class hardware, such as standalone virtual reality head-mounted displays. Also, the present embodiments may improve the visual appearance of surface details in the output frame, and also the visual appearance of contours in the output frame. The present embodiments also enable more efficient ray casting within the patches.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system are configured to implement a method according to various embodiments.

A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:

1 . A method for encoding, comprising:

2. A method for rendering, comprising

- receiving an encoded bitstream;

3. An apparatus for encoding, comprising:

- means for receiving a video presentation frame, where the video presentation represents a three-dimensional data;

4. The apparatus according to claim 3, wherein the metadata comprises a matrix of level-of-detail values.

5. The apparatus according to claim 3, wherein the metadata comprises information on motion vector offsets.

6. The apparatus according to claim 3, wherein the metadata comprises residual motion vectors being determined based on motion vectors extracted from compressed two-dimensional video.

7. The apparatus according to claim 3, wherein the metadata comprises information on a maximum depth threshold value defining how far apart in the depth dimension vertices of a triangle are allowed to be.

8. The apparatus according to claim 3, wherein the metadata comprises information on at least one optional depth slicing value, being a value between the minimum and maximum depth values of the patch, falling between distinct depth layers of the patch.

9. The apparatus according to claim 3, wherein the metadata comprises information on per-patch factor values being applied to adjusted vertices on a near layer and a far layer.

10. The apparatus according to claim 3, wherein the metadata comprises values for a surface estimation function, describing rough geometry of a surface layer inside the patch.

11 . The apparatus according to claim 3, wherein the metadata comprises a matrix of flags, wherein the flags indicate a presence of depth contours, and wherein the matrix covers an area of a patch.

12. The apparatus according to claim 3, wherein the metadata comprises one or more depth offsets that are applied to mesh vertices before beginning ray casting.

13. An apparatus for decoding comprising

- means for receiving an encoded bitstream;

- means for decoding from the bitstream metadata being associated with one or more patches, the metadata comprising information for mesh-based view synthesis volumetric video; - means for decoding from the bitstream one or more patches for a video presentation frame, wherein the patches contain contours of an object; and

14. The apparatus according to claim 13, wherein the metadata comprises a matrix of level-of-detail values, and the method further comprising determining how many triangles to allocate to each region of a patch by utilizing level of details values, resulting in variable projected mesh quality.

15. The apparatus according to claim 13, wherein the metadata comprises information on a maximum depth threshold value defining how far apart in the depth dimensions vertices of a triangle are allowed to be, and the method further comprises detecting that one of the vertices is an outlier with regard to a pre determined threshold and rendering a triangle twice.

16. The apparatus according to claim 13, wherein the metadata comprises information on at least one optional depth slicing value, being a value between the minimum and maximum depth values of the patch, falling between distinct depth layers, and the method further comprises rendering affected triangles multiple times when a patch has one or more depth slicing values.

17. The apparatus according to claim 13, wherein the metadata comprises information on per patch surface estimation function, and the method to generate depth values for any undefined vertices.

18. The apparatus according to claim 13, wherein the metadata comprises a matrix of flags covering an area of a patch, and the method further comprises splitting a patch into multiple smaller meshes by using per-patch depth contour flags.

19. The apparatus according to claim 13, wherein the metadata comprises one or more depth offsets that are applied to mesh vertices before beginning ray casting, and the method further comprises selecting an appropriate rendering method for a patch.

20. An apparatus for encoding, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

21. An apparatus for decoding, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive an encoded bitstream;