WO2022258879A2

WO2022258879A2 - A method, an apparatus and a computer program product for video encoding and video decoding

Info

Publication number: WO2022258879A2
Application number: PCT/FI2022/050324
Authority: WO
Inventors: Lauri Aleksi ILOLA; Christoph BACHHUBER; Lukasz Kondrad; Jaakko Olli Taavetti KERÄNEN; Vinod Kumar Malamal Vadakital
Original assignee: Nokia Technologies Oy
Priority date: 2021-06-09
Filing date: 2022-05-13
Publication date: 2022-12-15
Also published as: WO2022258879A3

Abstract

The embodiments relate to a method for encoding and decoding volumetric video. The embodiments also relate to equipment for implementing the methods. The method for encoding comprises obtaining one or more video sequences representing volumetric video; generating patches by converting samples of the volumetric video to one or more two-dimensional patches; analyzing a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences; as a result of analysis, deriving instructions for generating a mesh for rendering the volumetric video; and signaling said instructions along a volumetric video bitstream.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING

Technical Field

The present solution generally relates to encoding, signaling and rendering a volumetric video.

Background

Volumetric video data represents a three-dimensional scene or object. Such data describes geometry (shape, size, position in three-dimensional (3D) space) and respective attributes (e.g., colour, opacity, reflectance, ...), plus any possible temporal changes of the geometry and attributes at given time instances (like frames in two-dimensional (2D) video. Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.

Summary

Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided an apparatus comprising means for obtaining one or more video sequences representing volumetric video; means for generating patches by converting samples of the volumetric video to one or more two-dimensional patches; means for analyzing a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences; as a result of analysis, means for deriving instructions for generating a mesh for rendering the volumetric video; and means for signaling said instructions along a volumetric video bitstream. According to a second aspect, there is provided an apparatus comprising at least a processor, a memory and a computer program stored in said memory, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: obtain one or more video sequences representing volumetric video; generate patches by converting samples of the volumetric video to one or more two-dimensional patches; analyze a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences; as a result of analysis, derive instructions for generating a mesh for rendering the volumetric video; and signal said instructions along a volumetric video bitstream.

According to a third aspect, there is provided a method comprising obtaining one or more video sequences representing volumetric video; generating patches by converting samples of the volumetric video to one or more two-dimensional patches; analyzing a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences; as a result of analysis, deriving instructions for generating a mesh for rendering the volumetric video; and signaling said instructions along a volumetric video bitstream.

According to a fourth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to obtain one or more video sequences representing volumetric video; generate patches by converting samples of the volumetric video to one or more two-dimensional patches; to analyze a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences; as a result of analysis, to derive instructions for generating a mesh for rendering the volumetric video; and to signal said instructions along a volumetric video bitstream.

According to a fifth aspect, there is provided an apparatus comprising means for receiving encoded volumetric video bitstream comprising also instructions for rendering mesh; means for determining resource requirements of the volumetric video bitstream; means for estimating a client specific vertex budget based on resources available and determined resource requirements; means for generating a rendering mesh using estimated client specific vertex budget and instructions for rendering mesh; means for decoding from the bitstream information on volumetric video; and means for rendering the volumetric video using a rendering mesh.

According to a sixth aspect, there is provided a method comprising receiving encoded volumetric video bitstream comprising also instructions for rendering mesh; determining resource requirements of the volumetric video bitstream; estimating a client specific vertex budget based on resources available and determined resource requirements; generating a rendering mesh using estimated client specific vertex budget and instructions for rendering mesh; decoding from the bitstream information on volumetric video; and rendering the volumetric video using a rendering mesh.

According to a seventh aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive encoded volumetric video bitstream comprising also instructions for rendering mesh; determine resource requirements of the volumetric video bitstream; estimate a client specific vertex budget based on resources available and determined resource requirements; generate a rendering mesh using estimated client specific vertex budget and instructions for rendering mesh; decode from the bitstream information on volumetric video; and render the volumetric video using a rendering mesh.

According to an eighth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive encoded volumetric video bitstream comprising also instructions for rendering mesh; determine resource requirements of the volumetric video bitstream; estimate a client specific vertex budget based on resources available and determined resource requirements; generate a rendering mesh using estimated client specific vertex budget and instructions for rendering mesh; decode from the bitstream information on volumetric video; and render the volumetric video using a rendering mesh. According to an embodiment, the rendering mesh instructions are communicated with patch generation iteratively and performing a test rendering to refine rendering mesh instructions. According to an embodiment, the instructions for generating the rendering mesh comprises of number of mesh instances, mesh dimensions, and shapes of generic mesh structure.

According to an embodiment, the precise rendering mesh is signaled externally to the bitstream.

According to an embodiment, the rendering mesh shapes are signaled as a customizable look-up table.

According to an embodiment, the following is signaled: level of details for rendering mesh shapes, comprising level of detail indices describing relative quality between each mesh instance and flexible subdivision modifiers per mesh instance, as well as instructions for mesh surface generation type.

According to an embodiment, rendering mesh instructions are signaled for a sequence of volumetric video frames, further comprising means for indicating the preferred rendering mesh instance per patch.

According to an embodiment, the computer program product is embodied on a non- transitory computer readable medium.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1a shows an example of volumetric media conversion at an encoder;

Fig. 1 b shows an example of a volumetric media reconstruction at a decoder;

Fig. 2 shows an example of block to patch mapping;

Figs. 3a - c show examples of an atlas coordinate system; a local 3D patch coordinate system; and a final target coordinate system;

Fig. 4 shows an example of Tenderer operation using a rendering mesh; Fig. 5 shows an example of poor contour fit using a rendering mesh;

Fig. 6 shows an illustration of rendering errors caused by poor contour fit;

Fig. 7 shows an example of identification of typical mesh;

Fig. 8 shows an example of rendering mesh-based system optimization;

Fig. 9 shows an example of creating a rendering mesh based on typical contours in the content;

Fig. 10 shows an example of a decoder operation using a rendering mesh;

Fig. 11a is a flowchart illustrating a method according to an embodiment;

Fig. 11 b is a flowchart illustrating a method according to another embodiment, and

Fig. 12 shows an apparatus according to an embodiment.

The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. Flowever, in certain instances, well-known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments.

Reference in this disclosure to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure.

The present embodiments relate to the encoding, signaling and rendering a volumetric video. Increasing computational resources and advances in 3D data acquisition devices have enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi level surface maps.

In the following, a short reference of ISO/IEC DIS 23090-5 Visual Volumetric Video- based Coding (V3C) and Video-based Point Cloud Compression (V-PCC) 2nd Edition is given. Visual volumetric video comprising a sequence of visual volumetric frames, if uncompressed, may be represented by a large amount of data, which can be costly in terms of storage and transmission. This has led to the need for a high coding efficiency standard for the compression of visual volumetric data.

V3C enables the encoding and decoding processes of a variety of volumetric media by using video and image coding technologies. This is achieved through first a conversion of such media from their corresponding 3D representation to multiple 2D representations, also referred to as V3C video components, before coding such information. Such representations may include occupancy, geometry, and attribute components. The occupancy component can inform a V3C decoding and/or rendering system of which samples in the 2D components are associated with data in the final 3D representation. The geometry component contains information about the precise location of 3D data in space, while attribute components can provide additional properties, e.g., texture or material information, of such 3D data. An example is shown in Figures 1a and 1 b, where Figure 1a presents volumetric media conversion at an encoder, and where Figure 1 b presents volumetric media reconstruction at a decoder side. The 3D media is converted to a series of 2D representations: occupancy, geometry, and attributes. Additional information may also be included in the bitstream to enable inverse reconstruction. Additional information that allows associating all these V3C video components and enables the inverse reconstruction, from a 2D representation back to a 3D representation is also included in a special component, referred to in this document as the atlas. An atlas consists of multiple elements, named as patches. Each patch identifies a region in all available 2D components and contains information necessary to perform the appropriate inverse projection of this region back to the 3D space. The shape of such regions is determined through a 2D bounding box associated with each patch as well as their coding order. The shape of these regions is also further refined after the consideration of the occupancy information.

Atlases may be partitioned into patch packing blocks of equal size. The 2D bounding boxes of patches and their coding order determine the mapping between the blocks of the atlas image and the patch indices. Figure 2 shows an example of block to patch mapping with 4 projected patches onto an atlas when asps_patch_precedence_order_flag is equal to 0. Projected points are represented with dark grey. The area that does not contain any projected points is represented with light grey. Patch packing blocks are represented with dashed lines. The number inside each patch packing block represents the patch index of the patch to which it is mapped.

Axes orientations are specified for internal operations. For instance, the origin of the atlas coordinates is located on the top-left corner of the atlas frame. For the reconstruction step, an intermediate axes definition for a local 3D patch coordinate system is used. The 3D local patch coordinate system is then converted to the final target 3D coordinate system using appropriate transformation steps.

Figure 3a shows an example of a single patch packed onto an atlas image. This patch is then converted to a local 3D patch coordinate system (U, V, D) defined by the projection plane with origin O’, tangent (U), bi-tangent (V), and normal (D) axes. For an orthographic projection, the projection plane is equal to the sides of an axis- aligned 3D bounding box, as shown in Figure 3b. The location of the bounding box in the 3D model coordinate system, defined by a left-handed system with axes (X, Y, Z), can be obtained by adding offsets TilePatch3dOffsetU, TilePatch3DOffsetV, and TilePatch3DOffsetD, as illustrated in Figure 3c.

The generic mechanism of V3C may be used by applications targeting volumetric content. One of such applications is MPEG immersive video (MIV) (ISO/IEC 23090- 12). MIV enables volumetric video coding for applications in which a scene is recorded with multiple RGB(D) (red, green, blue, and optionally depth) cameras with overlapping fields of view (FoVs). One example setup is a linear array of cameras pointing towards a scene. This multiscopic view of the scene allows a 3D reconstruction and therefore 6DoF/3DoF+ consumption. MIV uses the patch data unit concept from V3C and extends it by using camera views for reprojection.

Coded V3C video components are referred to in this disclosure as video bitstreams, while a coded atlas is referred to as the atlas bitstream. Video bitstreams and atlas bitstreams may be further split into smaller units, referred to here as video and atlas sub-bitstreams, respectively, and may be interleaved together, after the addition of appropriate delimiters, to construct a V3C bitstream.

V3C patch information is contained in atlas bitstream, atlas_sub_bitstream(), which contains a sequence of NAL units. NAL unit is specified to format data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes. A NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet-oriented transport and sample streams is identical except that in the sample stream format specified in Annex D of ISO/IEC 23090-5 each NAL unit can be preceded by an additional element that specifies the size of the NAL unit.

NAL units in atlas bitstream can be divided to atlas coding layer (ACL) and non-atlas coding layer (non-ACL) units. The former dedicated to carry patch data while the later to carry data necessary to properly parse the ACL units or any additional auxiliary data.

In the nal_unit_header() syntax nal_unit_type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 4 of ISO/IEC 23090-5. naljayerjd specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. The value of naljayerjd shall be in the range of 0 to 62, inclusive. The value of 63 may be specified in the future by ISO/IEC. Decoders conforming to a profile specified in Annex A of ISO/IEC 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of naljayerjd not equal to 0. rbsp_byte[ i ] is the i-th byte of an RBSP. An RBSP is specified as an ordered sequence of bytes as follows:

The RBSP contains a string of data bits (SODB) as follows:

• If the SODB is empty (i.e., zero bits in length), the RBSP is also empty.

• Otherwise, the RBSP contains the SODB as follows: o The first byte of the RBSP contains the first (most significant, left-most) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain. o The rbsp_trailing_bits( ) syntax structure is present after the SODB as follows:

^■ The first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any).

^■ The next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit).

^■ When the rbsp_stop_one_bit is not the last bit of a byte-aligned byte, one or more bits equal to 0 (i.e. instances of rbsp_alignment_zero_bit) are present to result in byte alignment.

One or more cabac_zero_word 16-bit syntax elements equal to 0x0000 may be present in some RBSPs after the rbsp_trailing_bits( ) at the end of the RBSP.

Syntax structures having these RBSP properties are denoted in the syntax tables using an "_rbsp" suffix. These structures are carried within NAL units as the content of the rbsp_byte[ i ] data bytes. As an example the following may be considered as typical content:

• atlas_sequence_parameter_set_rbsp( ), which is used to carry parameters related to atlas on a sequence level.

• atlas_frame_parameter_set_rbsp( ), which is used to carry parameters related to atlas on a frame level and are valid for one or more atlas frames.

• sei_rbsp( ), used to carry SEI messages in NAL units.

• atlas_tile_group_layer_rbsp( ), used to carry patch layout information for tile groups.

When the boundaries of the RBSP are known, the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, right-most) bit equal to 1 , and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0. The data necessary for the decoding process is contained in the SODB part of the RBSP. atlas_tile_group_laye_rbsp() contains metadata information for a list off tile groups, which represent sections of frame. Each tile group may contain several patches for which the metadata syntax is described below.

Annex F of V3C V-PCC specification (23090-5) describes different SEI messages that have been defined for V3C MIV purposes. SEI messages assist in processes related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non-essential. V3C SEI messages are signalled in sei_rspb() which is documented below.

Non-essential SEI messages are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance.

Specification for presence of non-essential SEI messages is also satisfied when those messages (or some subset of them) are conveyed to decoders (or to the HRD) by other means not specified in V3C V-PCC specification (23090-5). When present in the bitstream, non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5). When the content of a non-essential SEI message is conveyed for the application by some means other than presence within the bitstream, the representation of the content of the SEI message is not required to use the same syntax specified in annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are actually present in the bitstream are counted.

Essential SEI messages are an integral part of the V3C bitstream and should not be removed from the bitstream. The essential SEI messages are categorized into two types:

• Type-A essential SEI messages: These SEIs contain information required to check bitstream conformance and for output timing decoder conformance. Every V3C decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance.

• Type-B essential SEI messages: V3C decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type-B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes.

A polygon mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral object in 3D computer graphics and solid modelling. The faces usually consist of triangles (triangle mesh), quadrilaterals (quads), or other simple convex polygons (n-gons), since this simplifies rendering, but may also be more generally composed of concave polygons, or even polygons with holes. Objects created with polygon meshes are represented by different types of elements. These include vertices, edges, faces, polygons and surfaces. In many applications, only vertices, edges and either faces or polygons are stored.

Polygon meshes are defined by the following elements:

• Vertex: A position in 3D space defined as (x, y, z) along with other information such as color (r, g, b), normal vector and texture coordinates.

• Edge: A connection between two vertices.

• Face: A closed set of edges, in which a triangle face has three edges, and a quad face has four edges. A polygon is a coplanar set of faces. In systems that support multi-sided faces, polygons and faces are equivalent. Mathematically a polygonal mesh may be considered an unstructured grid, or undirected graph, with additional properties of geometry, shape and topology.

• Surfaces: or smoothing groups, are useful, but not required to group smooth regions.

• Groups: Some mesh formats contain groups, which define separate elements of the mesh, and are useful for determining separate sub-objects for skeletal animation or separate actors for non-skeletal animation.

• Materials: defined to allow different portions of the mesh to use different shaders when rendered.

• UV coordinates: Most mesh formats also support some form of UV coordinates which are a separate 2D representation of the mesh "unfolded" to show what portion of a 2-dimensional texture map applies to different polygons of the mesh. It is also possible for meshes to contain other vertex attribute information such as color, tangent vectors, weight maps to control animation, etc. (sometimes also called channels).

When rendering volumetric content, a choice can be made to either render the content as points or meshes. In V-PCC and MIV, the position of the point and the colour of the point are defined by the geometry and texture maps (i.e., V3C video components) encoded as video. This type of content should be ideally rendered as points, considering that every encoded pixel represents a vertex and its colour. However, mobile GPU architecture is not ideal for rendering point primitives, due to high number of points required to represent solid surfaces. Instead, better quality and performance is achieved through mesh-based representations, where a surface is described more conservatively using fewer vertices that are connected through faces. High visual detail is preserved through a high-resolution texture map.

Triangle mesh projection and rasterization is very efficient in mobile GPUs and provides a suitable trade-off for faster and more efficient rendering in exchange for reduction in geometric detail. Current generation HW is highly optimized for such pipelines and mesh-based rendering remains dominant. The quality gains for mesh based rendering are even more significant on mobile HW, where the limited capacity of batteries and the constrained cooling capabilities set additional limitations in terms of power consumption.

The following bullets describe how a rendering mesh may be used to accommodate performance and quality constraints on mobile hardware. This is also illustrated in Figure 4.

• One-time: A set of meshes is created by a client and copied to static vertex (GPU) buffers

• Per GOP: Patch data for a GOP is copied (or decoded directly) to a GPU buffer; patch ID maps generated based on patch data are copied to a GPU texture.

• Per video frame: Geometry and texture atlases/patches are copied (or decoded directly) to GPU textures.

• Per render frame: Render patches:

• Select mesh instance (e.g., based on projection and patch dimensions) per patch.

• Vertex shader: sample geometry, unproject sampled points of patch to 3D space -> position for each mesh vertex

• Fragment shader: sample texture, discard unoccupied pixels + discard based on depth contour threshold min/max.

• Vertex and fragment shading may be repeated multiple times for patches that have more than one depth layer (= depth contours present) • Per render frame: Post-processing step: weighted blending of overlapping textures

While utilization of a rendering mesh provides important performance improvements, it also comes with significant challenges. Reduced geometric detail means that object surfaces and particularly contours are not represented accurately or correctly. This introduces particularly noticeable rendering artefacts which affect the perceived quality of the system.

V3C does not provide instructions for the client application on how to generate the rendering mesh, which would help minimize the problems caused by misaligned object contours. While the client application is able to analyze the patch information to create an approximation of the set of rendering meshes required using patch resolution and distance from camera, it would be also useful to know what type of contours exist in the patch. Edge detection is an expensive operation to do while rendering and is thus not viable for a practical use case, especially on mobile hardware. Figure 5 illustrates a poor depth contour fit on the rendering mesh. Figure 6 illustrates visual artefacts caused by a poor contour fit with the rendering mesh.

Furthermore, even if static mesh contours would be signaled, it does not account for temporal displacement of objects, which cause depth contours to move across polygon faces of the rendering mesh. For these reasons, it is better to consider encoder side analysis and signaling of helper data.

As mentioned, a rendering mesh may be used to improve rendering quality on resource constrained mobile hardware. The drawback of using a rendering mesh is that on one hand, calculating a set of sufficient rendering meshes is too expensive for a runtime process, while on the other hand, a poorly fitted rendering mesh results in severe visual artefacts. V3C does not accommodate signaling for precalculated rendering mesh and as such this invention proposes novel signaling for a scalable rendering mesh per content.

By analyzing the scene geometry as an encoder side process, typical depth contours and patch dimensions may be identified over the sequence or over parts of the sequence. The analysis should provide instructions for generating a rendering mesh that optimizes quality based on both client device and application capabilities as well as content itself. The instructions for generating the rendering mesh may be signaled along the volumetric video bitstream to improve rendering quality around edges and help client device manage rendering budget. Temporal updates to the set of instructions may occur, or all instructions may be signaled once in the beginning.

Additionally, generation of the rendering mesh could communicate with patch generation iteratively, to improve patch segmentation to better match with rendering mesh. As an example, patches could be refined so that depth contours remain temporally stable by updating offsets of the patch. Furthermore, patches which contain multiple depth contours could be divided into multiple patches which contain just a single depth contour. Figure 7 illustrates how depth contours may be identified from the input depth map, and how to generate a sample of rendering meshes based on the common depth contours.

In the following, an encoder according to an embodiment is discussed in more detailed manner. The encoder is configured to perform content analysis to derive set of instructions for generating a rendering mesh; an iterative process for improving patch generation-based depth contours. In addition, signaling is discussed in more detailed manner. Signaling comprises signaling of instructions for the rendering mesh (type and quantity). These instructions comprise precise mesh dimensions, shapes of generic mesh shapes and signaling rendering mesh shapes as customizable look-up table. In addition, the signaling comprises LOD details for rendering mesh shapes. This comprises LOD indices describing relative quality between each mesh instance, and flexible subdivision modifiers per mesh instance for more fine-grained quality control. In addition, the signaling comprises per patch data.

Encoding:

In order to signal a rendering mesh for given content, the content itself should be analyzed. How to arrive at the conclusion of what constitutes an optimal rendering mesh is an encoder optimization choice. Thus, it does not matter how the decision of the preferred rendering mesh was reached. The preferred rendering mesh may come from the content creator, who has precise 3D models of the rendered assets. The rendering mesh may be also estimated by the encoder, and there can be as many methods for choosing the rendering mesh as there are encoders. As an example, the following criteria may be considered by the encoder to compute a suitable rendering mesh: • Number of patches

• Resolution of patches

• Distance of a patch from the viewing volume

• Presence of depth contours in the patch

• Presence of multiple depth levels in the patch

• Presence of moving depth contours in the patch

The analysis will result in a set of rendering meshes, which may be included in V3C bitstream and furthermore encapsulated in a file format, where the client application may easily access the relevant information. The encoder system overview is described in Figure 8.

The system overview contains an optional refinement loop, where a test rendering is performed using the proposed intermediate rendering mesh to validate that the content meets certain quality criteria. Refinement information may be extracted from the test rendering to improve patch generation. This control loop is especially useful for reducing the number of depth levels in a patch.

Signaling:

The result of the analysis is instructions for generating a rendering mesh. The instructions may be conveyed in several forms.

In one embodiment explicit dimensions of the rendering mesh are provided. This can for example mean describing the shape of meshes (rectangle, circle, or other polygons as needed), the dimensions of meshes and the number of instances required. E.g.

• 6 rectangles with 50m x 50m (e.g., background cube faces)

• 4 rectangles with 10m x 5m (medium distance large objects)

• 20 rectangles with 1m x 1 m (medium distance small objects)

• 40 rectangles with 0,2m x 0,2m (close distance large patches)

• 100 rectangles with 0,1 m x 0,1 m (close distance small patches)

The syntax for rendering mesh based on this embodiment would is described in the following table:

num_rendering_mesh_minus1 provides the number for different types of rendering meshes in the bitstream. rendering_mesh_updates_flag indicates if updates to the shapes is expected. Depending on where the signaling for the rendering mesh takes place, updates may or may not be allowed. rendering_mesh_id is a unique identifier for the defined rendering mesh. rendering_mesh_count provides the information on how many rendering mesh of the given type is needed. rendering_mesh_shape describes the type of the rendering mesh based on the following table disclosing mapping information for rendering mesh shape.

rendering_mesh_type describes how the mesh is constructed using the mesh shape. Value equal to 0 indicates that triangle strip should be used. Value equal to 1 indicates that triangle fan should be used. Other values may exist depending on the set of shape types used. rendering_mesh_size_u provides the vertical size of the rendering mesh plane in meters or in other display units, such as pixels or voxels. rendering_mesh_size_v provides the horizontal size of the rendering mesh plane in meters or in other display units, such as pixels or voxels.

Signaling using explicit dimensions vs “pixel-units” could depend on the type of source camera. For normal cameras the perspective projection results in different size of the patch based on FoV and distance from the camera. V3C mesh signaling could be done in pixel when orthogonal projection is used.

In another embodiment generic shapes may be provided as target mesh types. In addition, the number of each mesh type for instancing is signaled. E.g.

• 60 rectangles

• 6 circles

• 20 triangles

In another embodiment the rendering mesh types may be signaled as a look-up table, which is generated during the analysis. The process of deriving a rendering mesh based on a rendering mesh look-up table is described in Figure 9.

In addition to the look-up table, the number of meshes for each shape described in the look-up table is signaled.

For each mesh shape that is signaled, relative quality indicators and quantities are also signaled. This allows the client application to adaptively generate optimized rendering mesh based on its computational budget.

In one embodiment level of detail-factors (LOD) are assigned to each shape for a given quantity. LOD may mean that LOD1 has twice as many subdivisions than LOD2. Alternatively, an arbitrary LOD descriptor may provide exact relative meaning between LODs. Using LODs, the example instructions for generating a rendering mesh could look like:

• 6 rectangles with 50m x 50m (e.g., background cube faces)

• 4 x LOD2, 2 x LOD1 (top and bottom faces are lower quality than the sides)

In another embodiment explicit sub-division modifiers per mesh type for a given quantity may be provided. Sub-division modifiers may be provided flexibly to accommodate different shape types. This could look like the following:

• 6 rectangles with 50m x 50m (e.g., background cube faces)

• 4 x (10U, 15V), 10 divisions horizontally and 15 vertically in mesh space

• 2 x (5U, 10V), 5 divisions horizontally and 10 vertically in mesh space

Furthermore, the type of the rendering mesh needs to be signaled. In one embodiment this can be signaled as a triangle strip, or triangle fan type of description. For example, when circular mesh shape is chosen it is better to choose triangle fan type of mesh structure.

In another embodiment, the rendering mesh may be signaled explicitly, by providing the precise mesh information to the client. The information may be encapsulated in V3C bitstream or provided by external means, such as gITF. In this embodiment the exact format of the rendering mesh is described, containing a list of vertices as well as the connectivity information describing the faces. The previous embodiments focused on the description of the rendering mesh, which is helpful for the client to determine how to allocate the resources for the rendering process. To project patches in high quality on the set of correct rendering mesh instances, the patches need to be linked to the rendering meshes that are compatible with the patches.

In on embodiment the patch_data_unit syntax structure is amended to contain linking information to patch indices.

asps_target_rendering_mesh_enabled_flag indicates that there is a preferred target rendering mesh shape identified with the patch. pdu_target_rendering_mesh_id provides the index to the preferred target rendering mesh in the list of rendering meshes in vps_rendering_mesh_extension().

The rendering mesh may be signaled using V3C bitstream level constructs, file format level tools or using an external off-band mechanisms. The presence of a rendering mesh may be signaled in the bitstream. This can be done in several syntax structures, such as v3c_parameter_set(), vps_miv_extension(), asps_miv_extension(), atlas_sequence_parameter_set_rbsp(), afps_miv_extension(), casps_miv_extension() caf_miv_extension(), or atlas_frame_parameter_set_rbsp(), depending on how often the rendering mesh information is expected to be updated and which specification is used to encapsulate the volumetric video bitstream. Below is an example of v3c_parameter_set() encapsulation.

vps_rendering _mesh_extension_present_flag equal to 1 specifies that the vps_rendering_mesh_extension( ) syntax structure is present in the v3c_parameter_set( ) syntax structure. vps_rendering_mesh_extension_present_flag equal to 0 specifies that this syntax structure is not present. When not present, the value of vps_rendering_mesh_extension_present_flag is inferred to be equal to 0.

The rendering mesh syntax structure depends on the place of bitstream, where it is signalled in. The embodiments related to the format of the rendering mesh, provide further details on what the rendering_mesh() structure could look like.

According to another embodiment new SEI messages may be added in V3C bitstream to signal the rendering mesh. According to another embodiment, the vui_parameters() may be extended to carry information related to rendering mesh.

Alternatively, or additionally, the rendering mesh information may be signaled using file format level functionality.

According to an embodiment the rendering mesh may be stored in the sample entry of the track, which carries atlas bitstream.

According to another embodiment, the rendering mesh may be stored in the sample of a new track for which a dedicated sample format is defined.

According to another embodiment, the rendering mesh may be stored as item(s) in file format. In this embodiment, it is possible to store external gITF objects as items that describe the mesh. Alternatively other binary representations may be used, provided that proper item property description is provided. Especially when using rendering-mesh look-up table, the look-up table could be signaled as an item.

According to another embodiment, the rendering mesh may be signaled off-band directly to the client application.

Decoder:

According to one embodiment, the decoder operates as follows. In order to efficiently utilize the signalled rendering mesh, the client device needs to estimate how many resources are needed to process V3C bitstream, e.g. how many video decoders are needed and how many pixels of texture data need to be processed by the GPU per frame. Furthermore, an estimation of the vertex budget needs to be made to decide on the granularity of the rendering mesh. Decoder side operation are described Figure 10. The rendering mesh may need to be refined or recalculated based on the signaling of the V3C bitstream, or due to other constraints.

The end result of the client using the signaled rendering mesh is improved quality of rendering, which is tailored based on the content and the client device capabilities. For example, the following client device capabilities could be considered when generating the rendering mesh:

• CPU performance

• GPU performance, GPU technologies supported, e.g., OES_EGL_image_external or Vulkan video API, tessellation, or mesh shading features, etc.

• Target device resolution

• Availability of hardware accelerated video decoders.

Based on the signaled rendering mesh, resource requirements and device capabilities, actual rendering mesh is generated and used for rendering V3C content. The rendering mesh may be updated based on information in V3C bitstream or if there is a change in the target device capabilities.

The method according to an embodiment is shown in Figure 11a. The method generally comprises obtaining 1105 one or more video sequences representing volumetric video; generating 1110 patches by converting samples of the volumetric video to one or more two-dimensional patches; analyzing 1115 a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences; as a result of analysis, deriving 1120 instructions for generating a mesh for rendering the volumetric video; and signaling 1125 said instructions along a volumetric video bitstream. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises means for obtaining one or more video sequences representing volumetric video; means for generating patches by converting samples of the volumetric video to one or more two-dimensional patches; means for analyzing a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences; as a result of analysis, means for deriving instructions for generating a mesh for rendering the volumetric video; and means for signaling said instructions along a volumetric video bitstream. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 11a according to various embodiments.

The method according to another embodiment is shown in Figure 11 b. The method generally comprises receiving 1130 encoded volumetric video bitstream comprising also instructions for rendering mesh; determining 1135 resource requirements of the volumetric video bitstream; estimating 1140 a client specific vertex budget based on resources available and determined resource requirements; generating 1145 a rendering mesh using estimated client specific vertex budget and instructions for rendering mesh; decoding 1150 from the bitstream information on volumetric video; and rendering 1155 the volumetric video using a rendering mesh. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to another embodiment comprises means for receiving encoded volumetric video bitstream comprising also instructions for rendering mesh; means for determining resource requirements of the volumetric video bitstream; means for estimating a client specific vertex budget based on resources available and determined resource requirements; means for generating a rendering mesh using estimated client specific vertex budget and instructions for rendering mesh; means for decoding from the bitstream information on volumetric video; and means for rendering the volumetric video using a rendering mesh. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 11b according to various embodiments.

An example of an apparatus is disclosed with reference to Figure 12. Figure 12 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device. The apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.

The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.

The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.

The various embodiments may provide advantages. For example, the present embodiments improve rendering quality around edges. In addition, the present embodiments improve allocation of vertices around object edges. Further, the present embodiments improve rendering performance by avoiding overly dense meshes. The present embodiments provide scalable approach for mesh quality signaling, which may be adjusted by client device capabilities. The present embodiments can be used to remove most visible rendering artefacts that are typical for mesh-based rendering of V3C content.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of a various embodiments.

A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:

1 . An apparatus, comprising:

- means for obtaining one or more video sequences representing volumetric video;

- means for generating patches by converting samples of the volumetric video to one or more two-dimensional patches;

- means for analyzing a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences;

- as a result of analysis, means for deriving instructions for generating a mesh for rendering the volumetric video; and

- means for signaling said instructions along a volumetric video bitstream.

2. The apparatus according to claim 1 , further comprising means for communicating the rendering mesh instructions with patch generation iteratively and performing a test rendering to refine rendering mesh instructions.

3. The apparatus according to claim 1 or 2, wherein the instructions for generating the rendering mesh comprises of number of mesh instances, mesh dimensions, and shapes of generic mesh structure.

4. The apparatus according to claim 1 or 2, wherein the precise rendering mesh is signaled externally to the bitstream.

5. The apparatus according to claim 3, wherein the rendering mesh shapes are signaled as a customizable look-up table.

6. The apparatus according to any of the claims 1 to 5, further comprising means for signaling level of details for rendering mesh shapes, comprising level of detail indices describing relative quality between each mesh instance and flexible subdivision modifiers per mesh instance, as well as instructions for mesh surface generation type.

7. The apparatus according to any of the claims 1 to 6, comprising signaling rendering mesh instructions for a sequence of volumetric video frames, further comprising means for indicating the preferred rendering mesh instance per patch.

8. An apparatus comprising

- means for receiving encoded volumetric video bitstream comprising also instructions for rendering mesh;

- means for determining resource requirements of the volumetric video bitstream;

- means for estimating a client specific vertex budget based on resources available and determined resource requirements;

- means for generating a rendering mesh using estimated client specific vertex budget and instructions for rendering mesh;

- means for decoding from the bitstream information on volumetric video; and

- means for rendering the volumetric video using a rendering mesh.

9. A method, comprising:

- obtaining one or more video sequences representing volumetric video;

- generating patches by converting samples of the volumetric video to one or more two-dimensional patches;

- analyzing a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences;

- as a result of analysis, deriving instructions for generating a mesh for rendering the volumetric video; and

- signaling said instructions along a volumetric video bitstream.

10. The method according to claim 9, further comprising communicating the rendering mesh instructions with patch generation iteratively and performing a test rendering to refine rendering mesh instructions.

11 . The method according to claim 9 or 10, wherein the instructions for generating the rendering mesh comprises of number of mesh instances, mesh dimensions, and shapes of generic mesh structures.

12. The method according to claim 9 or 10, wherein the precise rendering mesh is signaled externally to the bitstream.

13. The method according to claim 11 , wherein the rendering mesh shapes are signaled as a customizable look-up table.

14. The method according to any of the claims 9 to 13, further comprising signaling level of details for rendering mesh shapes, comprising level of detail indices describing relative quality between each mesh instance and flexible subdivision modifiers per mesh instance, as well as instructions for mesh surface generation type.

15. The method according to any of the claims 9 to 14, comprising signaling rendering mesh instructions for a sequence of volumetric video frames, further comprising indicating the preferred rendering mesh instance per patch.

16. A method comprising

- receiving encoded volumetric video bitstream comprising also instructions for rendering mesh;

- determining resource requirements of the volumetric video bitstream;

- estimating a client specific vertex budget based on resources available and determined resource requirements;

- generating a rendering mesh using estimated client specific vertex budget and instructions for rendering mesh;

- decoding from the bitstream information on volumetric video; and

- rendering the volumetric video using a rendering mesh.

17. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- obtain one or more video sequences representing volumetric video;

- generate patches by converting samples of the volumetric video to one or more two-dimensional patches;

- analyze a scene geometry of the volumetric video to identify depth contours and patch dimensions over one or more video sequences or over parts of the video sequences;

- as a result of analysis, derive instructions for generating a mesh for rendering the volumetric video; and

- signal said instructions along a volumetric video bitstream.

18. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: - receive encoded volumetric video bitstream comprising also instructions for rendering mesh;

- determine resource requirements of the volumetric video bitstream;

- estimate a client specific vertex budget based on resources available and determined resource requirements;

- generate a rendering mesh using estimated client specific vertex budget and instructions for rendering mesh;

- decode from the bitstream information on volumetric video; and

- render the volumetric video using a rendering mesh.