WO2023041838A1

WO2023041838A1 - An apparatus, a method and a computer program for volumetric video

Info

Publication number: WO2023041838A1
Application number: PCT/FI2022/050473
Authority: WO
Inventors: Lauri Aleksi ILOLA; Lukasz Kondrad; Jozsef Szabo; Christoph BACHHUBER; Aleksei MARTEMIANOV
Original assignee: Nokia Technologies Oy
Priority date: 2021-09-14
Filing date: 2022-06-28
Publication date: 2023-03-23
Also published as: EP4402637A1

Abstract

A method comprising: obtaining a plurality of volumetric media frames comprising three-dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; obtaining a set of vertex and mesh attributes from the plurality of volumetric media frames; comparing the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; determining, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; determining a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; encoding the first volumetric media frame and the related set of vertex and mesh attributes into a bitstream; encoding, for said at one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said one or more subsequent volumetric media frames in said group of volumetric media frames into said bitstream.

Description

AN APPARATUS, A METHOD AND A COMPUTER PROGRAM FOR

VOEUMETRIC VIDEO

TECHNICAE FIEED

[0001 ] The present invention relates to an apparatus, a method and a computer program for volumetric video coding.

BACKGROUND

[0002] Visual volumetric video-based coding (V3C; defined in ISO/IEC DIS 23090-5) provides a generic syntax and mechanism for volumetric video coding. The generic syntax can be used by applications targeting volumetric content, such as point clouds, immersive video with depth, and mesh representations of volumetric frames. The purpose of the specification is to define how to decode and interpret the associated data (atlas data in ISO/IEC 23090-5) which tells a Tenderer how to interpret 2D frames for reconstructing volumetric frames.

[0003] The current definition of V3C (ISO/IEC 23090-5) comprises two applications, i.e. video-based point cloud compression (V-PCC; defined in ISO/IEC 23090-5) and MPEG immersive video (MIV; defined in ISO/IEC 23090-12). Moreover, MPEG 3DG (ISO SC29 WG7) group has started a work on a third volumetric video coding application, i.e. V3C mesh compression.

[0004] In V3C mesh compression polygon meshes represented by vertices, edges, faces, polygons and surfaces are encoded by compressing e.g. vertex positions in 3D, connectivity data (faces), UV coordinates, as well as additional per-vertex attributes. The mesh compression technologies intended to be used in further development of V3C mesh compression produce a geometry compression, which consists of only intra frames without temporal inter-prediction. However, there is typically a significant amount of visual overlap between volumetric frames, which could be exploited to further improve geometry compression. Therefore, there is a need to find an enhanced solution to enable improved temporal geometry compression. SUMMARY

[0005] Now, an improved method and technical equipment implementing the method has been invented, by which the above problems are alleviated. Various aspects include a method, an apparatus and a computer readable medium comprising a computer program, or a signal stored therein, which are characterized by what is stated in the independent claims. Various details of the embodiments are disclosed in the dependent claims and in the corresponding images and description.

[0006] The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

[0007] According to a first aspect, there is provided a method comprising: obtaining a plurality of volumetric media frames comprising three-dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; obtaining a set of vertex and mesh attributes from the plurality of volumetric media frames; comparing the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; determining, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; determining a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; encoding the first volumetric media frame and the related set of vertex and mesh attributes into a bitstream; and encoding, for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said one or more subsequent volumetric media frames in said group of volumetric media frames into said bitstream.

[0008] An apparatus according to a second aspect comprises: means for obtaining a plurality of volumetric media frames comprising three-dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; means for obtaining a set of vertex and mesh attributes from the plurality of volumetric media frames; means for comparing the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; means for determining, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; means for determining a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; means for encoding the first volumetric media frame and the related set of vertex and mesh attributes into a bitstream; and means for encoding, for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said one or more subsequent volumetric media frames in said group of volumetric media frames into said bitstream.

[0009] According to an embodiment, the set of vertex and mesh attributes of a mesh comprises one or more of the following: vertex positions, indices, faces, UV coordinates, additional attributes.

[0 10] According to an embodiment, the apparatus comprises: means for quantizing the set of vertex and mesh attributes for each volumetric media frames before determining the difference of the at least one attribute.

[0011] According to an embodiment, the apparatus comprises: means for encoding a unique index for each temporally stable vertex within said group of volumetric media frames as an additional vertex related attribute.

[0012] According to an embodiment, the apparatus comprises: means for encoding the first volumetric media frame and the associated set of vertex and mesh attributes with Draco-encoding; means for encoding the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said inter frames with Dracoencoding; and means for including the encoded data into a Visual Volumetric Video-based Coding (V3C) bitstream.

[0013] According to an embodiment, the apparatus comprises: means for encoding the first volumetric media frame and the associated set of vertex and mesh attributes with Draco-encoding; means for encoding the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said inter frames with an encoding different from Draco-encoding; and means for including the encoded data into a Visual Volumetric Video-based Coding (V3C) bitstream.

[0014] According to an embodiment, the apparatus comprises: means for encoding information of obsolete, new or updated mesh entities by adding attributes for describing mesh entity status.

[0015] According to an embodiment, the mesh information configured to be carried in a V3C bitstream in one of the following containers: a V3C unit configured to store mesh information. a NAL unit configured to store mesh information inside atlas sub-bitstream, a SEI message configured to store the mesh information. one or more raw byte sequence payload (RBSP) patch modes configured to store mesh data inside V3C NAL units.

[0016] According to an embodiment, the apparatus comprises: means for including an indication about the mesh compression technique used to compress the intra frames and inter frames in or along said bitstream.

[0017] According to an embodiment, said indication about the mesh compression technique is configured to be carried out by at least one syntax element included in as an extension to the V3C parameter set syntax structure.

[0018] An apparatus according to a third aspect comprises at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a plurality of volumetric media frames comprising three-dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; obtain a set of vertex and mesh attributes from the plurality of volumetric media frames; compare the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; determine, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; determine a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; encode the first volumetric media frame and the related set of vertex and mesh attributes into a bitstream; and encode, for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said one or more subsequent volumetric media frames in said group of volumetric media frames into said bitstream.

[0019] A method according to a fourth aspect comprises: receiving a bitstream in a decoder, said bitstream comprising encoded 3D volumetric media frames; receiving, either in said bitstream or in a further bitstream, one or more signaling elements associated with encoded mesh information included in the bitstream comprising encoded 3D volumetric media frames; decoding, from the mesh information, a set of vertex and mesh attributes associated with each volumetric media frame; decoding vertices of at least one mesh in a first volumetric media frame in a group of volumetric media frames, based on the set of vertex and mesh attributes associated with the first volumetric media frame; decoding said vertices of said at least one mesh in one or more subsequent volumetric media frames in the group of volumetric media frames by a temporally prediction from the first volumetric media frame, based on the set of vertex and mesh attributes associated with said one or more subsequent volumetric media frames; and rendering said at least one mesh into a reconstructed 3D volumetric media frame.

[0020] An apparatus according to a fifth aspect comprises: means for receiving a bitstream in a decoder, said bitstream comprising encoded 3D volumetric media frames; means for receiving, either in said bitstream or in a further bitstream, one or more signaling elements associated with encoded mesh information included in the bitstream comprising encoded 3D volumetric media frames; means for decoding, from the mesh information, a set of vertex and mesh attributes associated with each volumetric media frame; means for decoding vertices of at least one mesh in a first volumetric media frame in a group of volumetric media frames, based on the a set of vertex and mesh attributes associated with the first volumetric media frame; means for decoding said vertices of said at least one mesh in one or more subsequent volumetric media frames in the group of volumetric media frames by a temporally prediction from the first volumetric media frame, based on the set of vertex and mesh attributes associated with said one or more subsequent volumetric media frames; and means for rendering said at least one mesh into a reconstructed 3D volumetric media frame.

[0021] An apparatus according to a sixth aspect comprises: at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive a bitstream in a decoder, said bitstream comprising encoded 3D volumetric media frames; receive, either in said bitstream or in a further bitstream, one or more signaling elements associated with encoded mesh information included in the bitstream comprising encoded 3D volumetric media frames; decode, from the mesh information, a set of vertex and mesh attributes associated with each volumetric media frame; decode vertices of at least one mesh in a first volumetric media frame in a group of volumetric media frames, based on the set of vertex and mesh attributes associated with the first volumetric media frame; decode said vertices of said at least one mesh in one or more subsequent volumetric media frames in the group of volumetric media frames by a temporally prediction from the first volumetric media frame, based on the set of vertex and mesh attributes associated with said one or more subsequent volumetric media frames; and render said at least one mesh into a reconstructed 3D volumetric media frame.

[0022] Computer readable storage media according to further aspects comprise code for use by an apparatus, which when executed by a processor, causes the apparatus to perform the above methods.

BRIEF DESCRIPTION OF THE DRAWINGS

[0023] For a more complete understanding of the example embodiments, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:

[0024] Figs, la and lb show an encoder and decoder for encoding and decoding 2D pictures; [0025] Figs. 2a and 2b show a compression and a decompression process for 3D volumetric video;

[0026] Fig. 3 shows an example of block-to-patch mapping with 4 projected patches onto an atlas;

[0027] Figs. 4a - 4c show an illustrative example of a patch projection into 2D domain for atlas data;

[0028] Figs. 5a and 5b show extensions to the V-PCC encoder and decoder to support mesh encoding and mesh decoding;

[0029] Fig. 6 shows an example illustrating the temporal correlation between two temporally adjacent meshes;

[0030] Fig. 7 shows a flow chart for an encoding method according to an embodiment;

[0031] Fig. 8 shows an exemplified block chart of an apparatus according to an embodiment;

[0032] Fig. 9 shows a flow chart for decoding method according to an embodiment; and [0033] Fig. 10 shows an exemplified block chart of an apparatus according to an embodiment.

DETAILED DESCRIPTON OF SOME EXAMPLE EMBODIMENTS

[0034] In the following, several embodiments of the invention will be described in the context of point cloud models for volumetric video coding. It is to be noted, however, that the invention is not limited to specific scene models or specific coding technologies. In fact, the different embodiments have applications in any environment where coding of volumetric scene data is required.

[0035] A video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can uncompress the compressed video representation back into a viewable form. An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate).

[0036] Volumetric video may be captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observer different parts of the world.

[0037] Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two-dimensional (2D) plane. However, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example. [0038] Volumetric video data represents a three-dimensional scene or object, and thus such data can be viewed from any viewpoint. Volumetric video data can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications. Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. color, opacity, reflectance, . ..), together with any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video). Volumetric video is either generated from 3D models, i.e. computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, etc. Also, a combination of CGI and real-world data is possible. Examples of representation formats for such volumetric data are triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time.

[0039] Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.

[0040] In 3D point clouds, each point of each 3D surface is described as a 3D point with color and/or other attribute information such as surface normal or material reflectance. Point cloud is a set of data points in a coordinate system, for example in a three- dimensional coordinate system being defined by X, Y, and Z coordinates. The points may represent an external surface of an object in the screen space, e.g. in a three-dimensional space.

[0041 ] In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression of the presentations becomes fundamental. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.

[0042] Instead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries. These geometries may be “unfolded” or packed onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information may be transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).

[0043] Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency can be increased greatly. Using geometry-projections instead of 2D-video based approaches based on multiview and depth, provides a better coverage of the scene (or object). Thus, 6DOF capabilities are improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/ decompression of the projected planes. The projection and the reverse projection steps are of low complexity.

[0044] Figs, la and lb show an encoder and decoder for encoding and decoding the 2D texture pictures, geometry pictures and/or auxiliary pictures. A video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. Typically, the encoder discards and/or loses some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate). An example of an encoding process is illustrated in Figure la. Figure la illustrates an image to be encoded (Iⁿ); a predicted representation of an image block (P'ⁿ); a prediction error signal (Dⁿ); a reconstructed prediction error signal (D'ⁿ); a preliminary reconstructed image (I'ⁿ); a final reconstructed image (R'ⁿ); a transform (T) and inverse transform (T ¹); a quantization (Q) and inverse quantization (Q ¹); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F).

[0045] An example of a decoding process is illustrated in Figure lb. Figure lb illustrates a predicted representation of an image block (P'ⁿ); a reconstructed prediction error signal (D'ⁿ); a preliminary reconstructed image (I'ⁿ); a final reconstructed image (R'ⁿ); an inverse transform (T ¹); an inverse quantization (Q ¹); an entropy decoding (E ¹); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F). [0046] Many hybrid video encoders encode the video information in two phases. Firstly pixel values in a certain picture area (or “block”) are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, i.e. the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate). Video codecs may also provide a transform skip mode, which the encoders may choose to use. In the transform skip mode, the prediction error is coded in a sample domain, for example by deriving a sample-wise difference value relative to certain adjacent samples and coding the sample-wise difference value with an entropy coder.

[0047] Many video encoders partition a picture into blocks along a block grid. For example, in the High Efficiency Video Coding (HEVC) standard, the following partitioning and definitions are used. A coding block may be defined as an NxN block of samples for some value of N such that the division of a coding tree block into coding blocks is a partitioning. A coding tree block (CTB) may be defined as an NxN block of samples for some value of N such that the division of a component into coding tree blocks is a partitioning. A coding tree unit (CTU) may be defined as a coding tree block of luma samples, two corresponding coding tree blocks of chroma samples of a picture that has three sample arrays, or a coding tree block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples. A coding unit (CU) may be defined as a coding block of luma samples, two corresponding coding blocks of chroma samples of a picture that has three sample arrays, or a coding block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples. A CU with the maximum allowed size may be named as LCU (largest coding unit) or coding tree unit (CTU) and the video picture is divided into non-overlapping LCUs.

[0048] In HEVC, a picture can be partitioned in tiles, which are rectangular and contain an integer number of LCUs. In HEVC, the partitioning to tiles forms a regular grid, where heights and widths of tiles differ from each other by one LCU at the maximum. In HEVC, a slice is defined to be an integer number of coding tree units contained in one independent slice segment and all subsequent dependent slice segments (if any) that precede the next independent slice segment (if any) within the same access unit. In HEVC, a slice segment is defined to be an integer number of coding tree units ordered consecutively in the tile scan and contained in a single NAL unit. The division of each picture into slice segments is a partitioning. In HEVC, an independent slice segment is defined to be a slice segment for which the values of the syntax elements of the slice segment header are not inferred from the values for a preceding slice segment, and a dependent slice segment is defined to be a slice segment for which the values of some syntax elements of the slice segment header are inferred from the values for the preceding independent slice segment in decoding order. In HEVC, a slice header is defined to be the slice segment header of the independent slice segment that is a current slice segment or is the independent slice segment that precedes a current dependent slice segment, and a slice segment header is defined to be a part of a coded slice segment containing the data elements pertaining to the first or all coding tree units represented in the slice segment. The CUs are scanned in the raster scan order of LCUs within tiles or within a picture, if tiles are not in use. Within an LCU, the CUs have a specific scan order.

[0049] Entropy coding/decoding may be performed in many ways. For example, context-based coding/decoding may be applied, where in both the encoder and the decoder modify the context state of a coding parameter based on previously coded/decoded coding parameters. Context-based coding may for example be context adaptive binary arithmetic coding (CABAC) or context-adaptive variable length coding (CAVLC) or any similar entropy coding. Entropy coding/decoding may alternatively or additionally be performed using a variable length coding scheme, such as Huffman coding/decoding or Exp-Golomb coding/decoding. Decoding of coding parameters from an entropy-coded bitstream or codewords may be referred to as parsing.

[0050] The phrase along the bitstream (e.g. indicating along the bitstream) may be defined to refer to out-of-band transmission, signaling, or storage in a manner that the out- of-band data is associated with the bitstream. The phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signaling, or storage) that is associated with the bitstream. For example, an indication along the bitstream may refer to metadata in a container file that encapsulates the bitstream.

[0051 ] A first texture picture may be encoded into a bitstream, and the first texture picture may comprise a first projection of texture data of a first source volume of a scene model onto a first projection surface. The scene model may comprise a number of further source volumes.

[0052] In the projection, data on the position of the originating geometry primitive may also be determined, and based on this determination, a geometry picture may be formed. This may happen for example so that depth data is determined for each or some of the texture pixels of the texture picture. Depth data is formed such that the distance from the originating geometry primitive such as a point to the projection surface is determined for the pixels. Such depth data may be represented as a depth picture, and similarly to the texture picture, such geometry picture (such as a depth picture) may be encoded and decoded with a video codec. This first geometry picture may be seen to represent a mapping of the first projection surface to the first source volume, and the decoder may use this information to determine the location of geometry primitives in the model to be reconstructed. In order to determine the position of the first source volume and/or the first projection surface and/or the first projection in the scene model, there may be first geometry information encoded into or along the bitstream. It is noted that encoding a geometry (or depth) picture into or along the bitstream with the texture picture is only optional and arbitrary for example in the cases where the distance of all texture pixels to the projection surface is the same or there is no change in said distance between a plurality of texture pictures. Thus, a geometry (or depth) picture may be encoded into or along the bitstream with the texture picture, for example, only when there is a change in the distance of texture pixels to the projection surface.

[0053] An attribute picture may be defined as a picture that comprises additional information related to an associated texture picture. An attribute picture may for example comprise surface normal, opacity, or reflectance information for a texture picture. A geometry picture may be regarded as one type of an attribute picture, although a geometry picture may be treated as its own picture type, separate from an attribute picture. [0054] Texture picture(s) and the respective geometry picture(s), if any, and the respective attribute picture(s) may have the same or different chroma format.

[0055] Terms texture (component) image and texture (component) picture may be used interchangeably. Terms geometry (component) image and geometry (component) picture may be used interchangeably. A specific type of a geometry image is a depth image. Embodiments described in relation to a geometry (component) image equally apply to a depth (component) image, and embodiments described in relation to a depth (component) image equally apply to a geometry (component) image. Terms attribute image and attribute picture may be used interchangeably. A geometry picture and/or an attribute picture may be treated as an auxiliary picture in video/image encoding and/or decoding.

[0056] Figures 2a and 2b illustrate an overview of exemplified compression/ decompression processes. The processes may be applied, for example, in MPEG visual volumetric video-based coding (V3C), defined currently in ISO/IEC DIS 23090-5: “Visual Volumetric Video-based Coding and Video-based Point Cloud Compression”, 2nd Edition. [0057] Visual volumetric video, a sequence of visual volumetric frames, if uncompressed, may be represented by a large amount of data, which can be costly in terms of storage and transmission. This has led to the need for a high coding efficiency standard for the compression of visual volumetric data.

[0058] V3C specification enables the encoding and decoding processes of a variety of volumetric media by using video and image coding technologies. This is achieved through first a conversion of such media from their corresponding 3D representation to multiple 2D representations, also referred to as V3C components, before coding such information. Such representations may include occupancy, geometry, and attribute components. The occupancy component can inform a V3C decoding and/or rendering system of which samples in the 2D components are associated with data in the final 3D representation. The geometry component contains information about the precise location of 3D data in space, while attribute components can provide additional properties, e.g. texture or material information, of such 3D data. An example of volumetric media conversion at an encoder is shown in Figure 2a and an example of a 3D reconstruction at a decoder is shown in Figure 2b. [0059] Additional information that allows associating all these subcomponents and enables the inverse reconstruction, from a 2D representation back to a 3D representation is also included in a special component, referred to in this document as the atlas. An atlas consists of multiple elements, named as patches. Each patch identifies a region in all available 2D components and contains information necessary to perform the appropriate inverse projection of this region back to the 3D space. The shape of such regions is determined through a 2D bounding box associated with each patch as well as their coding order. The shape of these regions is also further refined after the consideration of the occupancy information.

[0060] Atlases are partitioned into patch packing blocks of equal size. The 2D bounding boxes of patches and their coding order determine the mapping between the blocks of the atlas image and the patch indices. Figure 3 shows an example of block-to-patch mapping with 4 projected patches onto an atlas when asps_patch_precedence_order_flag is equal to 0. Projected points are represented with dark grey. The area that does not contain any projected points is represented with light grey. Patch packing blocks are represented with dashed lines. The number inside each patch packing block represents the patch index of the patch to which it is mapped.

[0061 ] Axes orientations are specified for internal operations. For instance, the origin of the atlas coordinates is located on the top-left comer of the atlas frame. For the reconstruction step, an intermediate axes definition for a local 3D patch coordinate system is used. The 3D local patch coordinate system is then converted to the final target 3D coordinate system using appropriate transformation steps.

[0062] Figure 4a shows an example of a single patch packed onto an atlas image. This patch is then converted to a local 3D patch coordinate system (U, V, D) defined by the projection plane with origin O’, tangent (U), bi-tangent (V), and normal (D) axes. For an orthographic projection, the projection plane is equal to the sides of an axis-aligned 3D bounding box, as shown in Figure 4b. The location of the bounding box in the 3D model coordinate system, defined by a left-handed system with axes (X, Y, Z), can be obtained by adding offsets TilePatch3dOffsetU, TilePatch3DOffsetV, and TilePatch3DOffsetD, as illustrated in Figure 4c. [0063] The generic mechanism of V3C may be used by applications targeting volumetric content. One of such applications is MPEG immersive video (MIV; defined in ISO/IEC 23090-12).

[0064] MIV enables volumetric video coding for applications in which a scene is recorded with multiple RGB(D) (red, green, blue, and optionally depth) cameras with overlapping fields of view (FoVs). One example setup is a linear array of cameras pointing towards a scene. This multi-scopic view of the scene allows a 3D reconstruction and therefore 6DoF/3DoF+ consumption.

[0065] MIV uses the patch data unit concept from V3C and extends it by using camera views for reprojection.

[0066] Coded V3C video components are referred to in this document as video bitstreams, while an atlas component is referred to as the atlas bitstream. Video bitstreams and atlas bitstreams may be further split into smaller units, referred to here as video and atlas sub-bitstreams, respectively, and may be interleaved together, after the addition of appropriate delimiters, to construct a V3C bitstream.

[0067] V3C patch information is contained in atlas bitstream, atlas_sub_bistream(), which contains a sequence of NAE units. A NAE unit is specified to format data and provides header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes. A NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet- oriented transport and sample streams is identical except that in the sample stream format specified in Annex D of ISO/IEC 23090-5 each NAL unit can be preceded by an additional element that specifies the size of the NAL unit.

[0068] NAL units in atlas bitstream can be divided to atlas coding layer (ACL) and nonatlas coding layer (non-ACL) units. The former dedicated to carry patch data while the latter to carry data necessary to properly parse the ACL units or any additional auxiliary data.

[0069] In the nal_unit_header() syntax nal unit type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 4 of ISO/IEC 23090-5. nal layer id specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. The value of nal layer id shall be in the range of 0 to 62, inclusive. The value of 63 may be specified in the future by ISO/IEC. Decoders conforming to a profile specified in Annex A of ISO/IEC 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of nal layer id not equal to 0.

[0070] Thus, the visual volumetric video-based coding (V3C; ISO/IEC DIS 23090-5) as described above specifies a generic syntax and mechanism for volumetric video coding. The generic syntax can be used by applications targeting volumetric content, such as point clouds, immersive video with depth, and mesh representations of volumetric frames. The purpose of the specification is to define how to decode and interpret the associated data (atlas data in ISO/IEC 23090-5) which tells a Tenderer how to interpret 2D frames for reconstructing volumetric frames.

[0071 ] In addition to the two applications of V3C (ISO/IEC 23090-5), i.e. V-PCC (ISO/IEC 23090-5) and MIV (ISO/IEC 23090-12), MPEG 3DG (ISO SC29 WG7) group has started a work on a third application, i.e. V3C mesh compression.

[0072] A polygon mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral object in 3D computer graphics and solid modelling. The faces usually consist of triangles (triangle mesh), quadrilaterals (quads), or other simple convex polygons (n-gons), since this simplifies rendering, but may also be more generally composed of concave polygons, or even polygons with holes.

[0073] Objects created with polygon meshes are represented by different types of elements. These include vertices, edges, faces, polygons and surfaces with the following definitions:

- Vertex: A position in 3D space defined as (x,y,z) along with other information such as color (r,g,b), normal vector and texture coordinates.

- Edge: A connection between two vertices.

- Face: A closed set of edges, in which a triangle face has three edges, and a quad face has four edges. A polygon is a coplanar set of faces. In systems that support multi-sided faces, polygons and faces are equivalent. Mathematically a polygonal mesh may be considered an unstructured grid, or undirected graph, with additional properties of geometry, shape and topology. - Surfaces: or smoothing groups, are useful, but not required to group smooth regions.

- Groups: Some mesh formats contain groups, which define separate elements of the mesh, and are useful for determining separate sub-objects for skeletal animation or separate actors for non-skeletal animation.

- Materials: defined to allow different portions of the mesh to use different shaders when rendered.

- UV coordinates: Most mesh formats also support some form of UV coordinates which are a separate 2D representation of the mesh "unfolded" to show what portion of a 2-dimensional texture map to apply to different polygons of the mesh. It is also possible for meshes to contain other such vertex attribute information such as color, tangent vectors, weight maps to control animation, etc. (sometimes also called channels).

[0074] Figures 5a and 5b show the extensions to the V-PCC encoder and decoder to support mesh encoding and mesh decoding, respectively, as proposed in MPEG M47608. [0075] In the encoder extension, the input mesh data is demultiplexed into vertex coordinate+attributes data and vertex connectivity data. The vertex coordinate+attributes data is coded using MPEG-I V-PCC, whereas the vertex connectivity data is coded as auxiliary data. Both of said data are multiplexed to create the final compressed output bitstream. Vertex ordering is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC to reorder the vertices for optimal vertex connectivity encoding. [0076] In the decoder, the input bitstream is demultiplexed to generate the compressed bitstreams for vertex coordinates+attributes data and vertex connectivity data. The vertex coordinates+attributes data is decompressed using MPEG-I V-PCC decoder. Vertex ordering is carried out on the reconstructed vertex coordinates at the output of MPEG-I V- PCC decoder to match the vertex order at the encoder. The vertex connectivity data is also decompressed, and everything is multiplexed to generate the reconstructed mesh.

[0077] Thus, mesh data may be compressed directly without projecting it into 2D- planes, like in V-PCC based mesh coding. In fact, the further developments of V-PCC mesh compression are intended to utilize an off-the-shelf mesh compression technology called Draco (https://google.github.io/draco/) for compressing mesh data excluding textures. Draco is used to compress vertex positions in 3D, connectivity data (faces) as well as UV coordinates. Additional per-vertex attributes may be also compressed using Draco. The actual UV texture may be compressed using traditional video compression technologies, such as H.265 or H.264.

[0078] Draco uses an edgebreaker algorithm at its core to compress 3D mesh information. It offers a good balance between simplicity and efficiency, and it is part of Khronos endorsed extensions for the glTF (Graphics Language Transmission Format) specification. The main idea of the algorithm is to traverse mesh triangles in a deterministic way so that each new triangle is encoded next to an already encoded triangle. This enables prediction of vertex specific information from the previously encoded data by simply adding delta to the previous data. The edgebreaker utilizes symbols to signal how each new triangle is connected to the previously encoded part of the mesh. Connecting triangles in such a way results in an average of 1 - 2 bits per triangle when combined with existing binary encoding techniques.

[0079] Using Draco for geometric compression as intra-compression for every frame in ISO/IEC mesh compression development means that geometry compression consists of only intra frames without temporal inter-prediction. However, there is typically a significant amount of visual overlap between volumetric frames, which could be exploited to further improve geometry compression. The input mesh can be pre-processed so that vertices and vertex related attributes, such as UV-coordinates and topology, remain temporally predictable or stable over a period of frames.

[0080] Figure 6 shows an example of temporal correlation of two volumetric frames, where frame 1 is shown in brighter color and frame 2 in darker color. The illustration of Figure 6 clarifies how vertices do not typically move much between subsequent frames. Also, the connectivity remains largely the same.

[0081 ] Accordingly, the approach of Draco does not benefit from temporal dependency and overlap of geometric information between frames, but rather the same data is re-sent with each frame.

[0082] In the following, an enhanced method for enabling improved temporal geometry compression will be described in more detail, in accordance with various embodiments. [0083] The method, which is disclosed in Figure 7, comprises obtaining (700) a plurality of volumetric media frames comprising three-dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; obtaining (702) a set of vertex and mesh attributes from the plurality of volumetric media frames; comparing (704) the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; determining (706), among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; determining (708) a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; encoding (710) the first volumetric media frame and the associated set of vertex and mesh attributes into a bitstream; and encoding (712), for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said inter frames into said bitstream.

[0084] Thus, the method enables to identify temporally stable vertices between volumetric media frames, and it provides temporal prediction capabilities of mesh information between subsequent volumetric media frames. Accordingly, for encoding a mesh, the vertices of which having been identified as being temporally stable across a group of volumetric media frames, it sufficient to encode only the first volumetric media frame in said group of volumetric media frames with full data (as an I-frame), whereas for the subsequent volumetric media frames, only the one or more differences in the set of vertex and mesh attributes per each vertex of the mesh may be encoded. Accordingly, the inter frames (P-frames) comprise substantially merely information about the differences in the set of vertex and mesh attributes per each vertex of the mesh, thereby ignoring the data redundant to the intra frame and significantly reducing the bitrate of the encoded bitstream. The decoder may then, upon decoding the mesh across the group of volumetric media frames, use the one or more differences in the set of vertex and mesh attributes as temporal inter-prediction from said first volumetric media frame. [0085] According to an embodiment, the set of vertex and mesh attributes of a mesh comprises one or more of the following: vertex positions, indices, faces, UV coordinates, additional attributes. Thus, mesh information may include vertex positions, i.e. 3D values describing the position of a vertex in 3D space, UV coordinates describing how to map texture information onto the mesh surface and connectivity data describing how vertices form faces on which texture information may be mapped. Additional attributes, such as normal, may be added using indexing or assigning information directly per vertex or face. [0086] According to an embodiment, the method comprises quantizing the set of vertex and mesh attributes for each volumetric media frames before determining the difference of the at least one attribute. Hence, for facilitating the determining of the difference of the at least one attribute and encoding the difference, the attributes may be quantized.

[0087] It is noted that the terms “difference of the at least one attribute”, “delta” or “quantized delta” may be used interchangeably to describe mesh information, i.e. vertex and mesh attributes, that may be updated per frame. For example, for vertex position, this may mean signaling the difference in spatial position from previous frame. For UV coordinates, this may mean signaling the difference between UV coordinates from previous frame.

[0088] According to an embodiment, the method comprises encoding a unique index for each temporally stable vertex within said group of volumetric media frames as an additional vertex related attribute. Thus, the unique index per vertex provides a simple means for signaling and indicating the temporally stable vertices within said group of volumetric media frames to the decoder, thus enabling the same vertex to be identified on multiple volumetric frames.

[0089] According to an embodiment, the method comprises encoding the first volumetric media frame and the associated set of vertex and mesh attributes with Dracoencoding; encoding the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said inter frames with Draco-encoding; and including the encoded data into a Visual Volumetric Video-based Coding (V3C) bitstream. [0090] According to an embodiment, the method comprises encoding the first volumetric media frame and the associated set of vertex and mesh attributes with Dracoencoding; encoding the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said inter frames with an encoding different from Draco-encoding; and including the encoded data into a Visual Volumetric Video-based Coding (V3C) bitstream.

[0091 ] Thus, the inter frames (P-frames), more precisely the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh therein, may be encoded with the same Draco-encoder as the intra frames (I-frames), thereby simplifying the encoding implementation.

[0092] On the other hand, due to the different nature of data included in the inter frames, another compression technology may provide better compression efficiency. Therefore, for example, symbol-based compression technologies, such as Huffmancompression or zlib-compression, may be used for inter frames (P-frames). Huffman and zlib are especially well suited for the compression of quantized deltas as described herein, where the vertex attributes tend to change as groups. Accordingly, generation of symbols for similar transformations offers significant gains. Error I Reference soBree oot f u L How these transformations typically take place is visualized in the previously referred Figure 6. For example, the hand of the model is moving along a motion vector that is shared by all vertices of the hand, thus the quantized delta of the per vertex attribute transformations is also shared. This gives a good indication on the compression gains for symbol-based compression technologies.

[0093] An apparatus suitable for implementing the method comprises means for obtaining a plurality of volumetric media frames comprising three-dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; means for obtaining a set of vertex and mesh attributes from the plurality of volumetric media frames; means for comparing the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices faces; means for determining, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; means for determining a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; means for encoding the first volumetric media frame and the associated set of vertex and mesh attributes into a bitstream; and means for encoding, for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said inter frames into said bitstream.

[0094] Figure 8 shows an exemplified block chart of an encoder implemented in such an apparatus. In the example of Figure 8, the apparatus receives multiple volumetric video frames (800). The input may consist of any type of 3D information temporally dividable into frames. As part of the encoder, texture data (802) may be extracted from the 3D content and be encoded as frames of an attribute V3C video component (804) in V3C bitstream. The encoder side process that is responsible for finding stable vertices between volumetric frames is performed by analysis unit (806). The analysis unit carries out a spatio-temporal mesh analysis process, which analyses multiple frames of input 3D data and identifies how vertex attributes evolve over time. Vertex attributes may include information, such as position, associated UV coordinates (or one or more indices to such), color data or similar information, that can be associated with a vertex. All vertex attributes may be considered as input for finding best candidates in vertex indexing. In addition, analyzing the vertex connectivity is preferable for finding the best candidate. Typically, best matching vertices share the mesh topology between frames.

[0095] As an output of spatio-temporal mesh analysis (806), I-frames (808) and P- frames (810) for a mesh can be identified. I-frames contain full temporal updates for vertex attributes and connectivity information. These may be used in a similar manner as video encoded I-frames. For vertex attributes, P-frames contain a quantized update to the previous frame describing how the vertex attribute evolves over the temporal period. Changes related to mesh connectivity may also be carried out in P-frames when new vertices appear, old ones become obsolete or if there is another reason for updating the mesh topology.

[0096] According to an embodiment, the apparatus comprises means for encoding information of obsolete, new or updated mesh entities by adding attributes for describing mesh entity status. The information of obsolete or new mesh entities, such as obsolete/new vertex, UV coordinates or face, may be used in the process of encoding a common index for each temporally stable vertex within said group of volumetric media frames as an additional vertex related attribute. As an example, the following algorithm may be used for assigning temporally stable indices for vertices:

• For frame x in group of frames o For every vertex i in frame x

■ If vertex i was not assigned an index in the previous iteration of the loop

• Assign vertex i as “new”

■ Find n closest neighboring vertices in 3D space from frame x+1

■ Compare connectivity topology of n candidates and vertex i

■ Weighted decision balancing similar topology and 3D distance between vertices

■ Order n candidates according to best match for vertex i o For every vertex i in frame x

■ Select best unique candidate from n candidates in frame x+1 so that no other vertex in frame x connects to the same candidate from frame x+1

■ If best candidate is found

• Assign the same index for candidate vertex from frame x+1 as the vertex i

• Calculate quantized delta for the vertex i attributes comparing attributes between it and best candidate from frame x+1

■ Else best candidate is not found

• Assign vertex as “obsolete” for frame x+1 o Update topology for frame x

■ Update the mesh topology to reflect new vertices and ones that became obsolete

[0097] By connecting vertices from frame x to only single vertex in frame x+1, the scenario in which the mesh begins to collapse as temporal updates merge faces can be avoided. Information how vertex attributes have evolved over several previous frames (x- 1, x-2, etc.) may be used as additional information to find the most suitable candidate from frame x+1 instead of simply looking at vertex attributes from frame x.

[0098] Referring back to Figure 8, mesh I-frames (808) and P-frames (810) may be further compressed using a mesh encoder (812). The above-described temporally stable vertex index may be added as an additional attribute of a vertex, wherein I-frames are encoded with this additional attribute using mesh compression algorithm, e.g. Draco, along with full instance of the mesh information, which generates a compressed mesh intra frame (814). P-frames contain quantized differences for vertex attributes from previous frames along with updates to mesh topology. The P-frames may be encoded using mesh compression algorithm, such as Draco, thereby resulting in compressed mesh delta frames (816).

[0099] The compressed mesh intra frames (814) and the compressed mesh delta frames (816), along with the possible attribute V3C video components (804), are encoded with a V3C encoder (818) into a V3C bitstream (820). In or along said V3C bitstream (820), a signaling (822) is included for informing the decoder about the encoded mesh information. [0100] Table 1 shows an example for the syntax for mesh information carrying noncompressed I-frames and P-frames.

Table 1. (An ISO/IEC 23090-5 example)

[0101] In the above syntax, mi_num_vertices indicates the number of vertices in the mesh. The semantics of the value is dependent on the type of frame. For I-frames, the syntax specifies the total number of vertices in the mesh for the given frame. For P-frames, the syntax specifies the number of new, updated, or obsolete vertices. It is possible to signal mi num vertices as 0 to indicate that information from previous frame is still valid with no changes.

[0102] mi_vertex_status indicates the vertex type as specified in Table 2.

Table 2. An example of mi _yertex status types.

[0103] mi vertex position x, mi vertex position y, mi vertex position z indicate information related to vertex 3D position and is dependent on the type of frame. I-frames contain full 3D positions for vertices, whereas P-frames contain the difference of each 3D component to previous frame. [0104] mi_vertex_index indicates unique index for the vertex so that a vertex with the same mi vertex index in all frames, belonging to the same group of frames, can be considered as the same instance of the vertex at times. For an I-frame, mi vertex index may be implicitly assigned as i to avoid signaling indices for I-frames.

[0105] mi_num_uv_coordinates indicates the number of UV coordinates in the mesh. The semantics of the value is dependent on the type of frame. For I-frames, the syntax specifies the total number of UV coordinates in the mesh for the given frame. For P- frames, the syntax specifies the number of new, updated, or obsolete UV coordinates. It is possible to signal mi num uv coordinates as 0 to indicate that information from previous frame is still valid with no changes.

[0106] mi_uv_status indicates the UV coordinate type as specified in Table 4.

[0107] mi_coordinate_u, mi_coordinate_v indicate information related to UV coordinates and is dependent on the type of frame. I-frames contain full U and V coordinates, whereas P-frames contain the difference of U and V coordinates to previous frame.

[0108] mi_uv_index indicates unique index for the UV coordinate so that uv coordinate with the same mi uv index in all frames, belonging to the same group of frames, can be considered as the same instance of the uv coordinate at different time, mi uv index for I- frame may be implicitly assigned as i to avoid signaling indices for I-frames.

[0109] It is noted that vertex dependent attributes may be added in the syntax either by adding fields in the vertex loop or tying the information based on vertex indices, like UV coordinates. For this example, UV coordinates are added as its own array so that the UV- coordinates may be reused by multiple vertices. Typically, normals for the vertices could be added in the semantics like this.

[0110] mi_num_faces indicates the number of faces in the mesh. The semantics of the value is dependent on the type of frame. For I-frames, the syntax specifies the total number of faces in the mesh for the given frame. For P-frames, the syntax specifies the number of new, updated, or obsolete faces. It is possible to signal mi num faces as 0 to indicate that information from the previous frame is still valid with no changes.

[0111] mi_face_status indicates the face type as specified in Table 2. [0112] mi_face_vertex_index indicates the vertex index (mi vertex index) for the vertex that describes the comer of the polygon.

[0113] mi_face_uv_coordinate_index indicates the UV coordinate index (mi uv index) for the UV coordinate that describes the comer of the polygon.

[0114] It is noted that additional face related information may be added in the syntax by adding fields in the face loop. For example, material index could be assigned like this per face. Moreover, additional per vertex, uv-coordinate, face or payload level flags and syntax optimizations may be considered, for example, in a similar manner as in mi vertex status != V3C OBSOLETE syntax in the structure of Table 1.

[0115] According to an embodiment, quantized updates for vertex attributes, such as position, may be added as an additional attribute in Draco so that mesh information for the entire group of frames (GOP) could be compressed using Draco. Mesh topology from the first frame of the group could be re-used for the rest of the frames in the group. In this example, each group of compressed frames (GOP) may be labeled as I-frame, considering that it contains full information on mesh stmcture over multiple frames. In this case the vertex and UV coordinate indexing is explicit and separate explicit indices (mi vertex index, mi uv index) for neither is required.

[0116] According to an embodiment, when a new vertex appears, an additional attribute detailing the duration of validity for the vertex may be signaled. The duration indicates how many frames said vertex is valid before becoming obsolete. The duration-attribute may be multiplexed with another attribute, such as the status. Value 0 could be used to signal that the vertex is valid until the end of the GOP. Accordingly, separate signaling in a P-frame vertices that become obsolete is avoided.

[0117] According to an embodiment related to the forward signaling of the duration of the validity of a vertex, an aggregate mesh for the whole duration of the GOP may be defined, which will contain all the vertices and the related connectivity that will appear during the GOP. In addition to the duration of validity, aggregate mesh may signal the start of validity of the vertex, as well as connectivity information.

[01 18] Upon receiving said additional attribute, the decoder is capable of reconstructing the correct mesh for each frame (I or P) for the rendering. The approach enables to obtain an aggregated mesh containing all the information for the duration of the GOP, which can be encoded with Draco as well. Accordingly, there is no need for signaling of per P-frame changes. Possible changing attributes, such as texture index or 3D position info per vertices, during the GOP may be addressed with an aggregate, variable size attribute or via adding a necessary number of duplicated vertices with same connectivity as the vertex that is duplicated.

[0119] According to an embodiment, the mesh information configured to be carried in V3C bitstream in one of the following containers:

- a new V3C unit (referred to as V3C CMD) configured to store mesh information.

- a new NAL unit (referred to as NAL I CMD and NAL P CMD) configured to store mesh information inside atlas sub-bitstream.

- a new SEI messages (referred to as SEI CMD) configured to store the mesh information.

- a new raw byte sequence payload (RBSP) patch modes (referred to as P CMD and I CMD) configured to store mesh data inside V3C NAL units.

[0120] While any of the above containers may be used for carrying the mesh information in V3C bitstream, the following embodiments focus on introducing new RBSP patch modes to define the specifics that are required for mesh signaling. Similar concepts for signalling could be implemented using other embodiments. The following embodiments focus on describing which type of syntax and semantics additions should be made to ISO/IEC FDIS 23090-5 to support the new functionality.

[0121] According to an embodiment, the apparatus comprises means for including an indication about the mesh compression technique used to compress the intra frames and inter frames in or along said bitstream. Thus, to indicate the mesh compression technique used to compress I-frames and P-frames to the decoder for performing the appropriate reverse operations, an extension to the V3C parameter set in ISO/IEC FDIS 23090-5 may be added. Table 3 shows an example of a V3C parameter set extension.

Table 3. An example of a V3C parameter set extension

[0122] vps compressed mesh extension present flag equal to 1 specifies that the vps_compressed_mesh_extension( ) syntax structure is present in the v3c_parameter_set( ) syntax structure. vps_compressed_mesh_extension_present_flag equal to 0 specifies that this syntax structure is not present. When not present, the value of vps_compressed_mesh_extension_present_flag is inferred to be equal to 0.

[0123] According to an embodiment, when vps compressed mesh extension is present,

V3C bitstream is shall not contain geometry or occupancy related V3C units. The bitstream shall only contain V3C units with vuh unit type V3C AD and V3C AVD.

[0124] According to an embodiment, when vps compressed mesh extension is present, V3C bitstream is shall not contain occupancy related V3C units. The bitstream shall only contain V3C units with vuh unit type V3C AD, V3C AVD and V3C GVD. In this case V3C units containing geometry data shall be interpreted as bump maps to describe displacement values for mesh surface.

[0125] An example for the compressed mesh extension syntax structure is defined in Table 4. It is noted that the compressed mesh extension may also be stored in another parameter sets of the V3C bitstream. Especially, if it is necessary to adjust settings of the extension during the sequence, an extension to Atlas Sequence Parameter Set (ASPS), Atlas Frame Parameter Set (AFPS) or Atlas Adaptation Parameter Set (AAPS) may be used.

Table 4. An example of VPS compressed mesh extension

[0126] vps_i_frame_compression_scheme identifies the compression used to compress mesh information for I-frames. The values of vps i frame compression scheme can be mapped to mesh compression techniques according to Table 5.

Table 5. An example of mesh compression schemes

[0127] vps_p_frame_compression_scheme identifies the compression used to compress mesh information for P-frames. The values of vps p framc comprcssion schcmc can be mapped to mesh compression techniques according to Table 5. [0128] vps num corners of polygon indicates the type of polygons that form the mesh. Typically, the polygons are either triangles or quads, i.e. the value of 3 or 4.

[0129] vps cmdu byte length descriptor indicates the length in bytes of a field in compressed mesh data unit, preceding the compressed mesh data, used to signal number of bytes in the compressed mesh data unit.

[0130] According to an embodiment, to carry compressed mesh information in V3C atlas data, a new patch mode, which may be referred to as an atdu_patch_mode, is introduced for both I TILE and P TILE atlas tiles. I TILE patch mode for the compressed mesh data could be I_CMD and P TILE patch mode P CMD. Table 6 shows an example how the new patch mode could be added in I TILE atlas tile types.

[0131] Table 7, in turn, illustrates an example for carrying patches of type I_CM and P CM as included in patch information data description.

Table 7. An example of patch information data structure.

[0132] The actual compressed_mesh_data_unit(tileID, patchldx) shall contain compressed bitstream according to information in vps_compressed_mesh_extension(). compressed_mesh_data_unit() with patchMode equal to I CMD shall be compressed using vps i frame compression scheme and compressed_mesh_data_unit() with patchMode equal to P CMD shall be compressed using vps_p_frame_compression_scheme. The syntax for compressed_mesh_data_unit() is the same regardless of the patchMode and is defined in Table 8, only the semantics of the decompressed data change as described in Table 1.

Table 8. An example of compressed mesh data unit structure.

[0133] The first value in the compressed mesh data unit provides the length of the data unit in bytes, excluding the size of num bytes in data unit. The size of the num bytes in data unit may be indicated in a parameter set, for example in VPS the information could be signaled as vps cmdu byte length descriptor. This information could be signaled in another parameter set such as ASPS or AFPS.

[0134] cmdu_byte[] array contains a compressed mesh information bitstream that need to be decoded using signaled mesh compression technology to retrieve non-compressed mesh I-frames and P-frames. After decompression with the correct decoder, per frame values for mesh information may be retrieved as defined in Table 1.

[0135] Another aspect relates to the operation of a decoder. Figure 9 shows an example of a decoding method comprising receiving (900) a bitstream in a decoder, said bitstream comprising encoded 3D volumetric media frames; receiving (902), either in said bitstream or in a further bitstream, one or more signaling elements associated with encoded mesh information included in the bitstream comprising encoded 3D volumetric media frames; decoding (904), from the mesh information, a set of vertex and mesh attributes associated with each volumetric media frame; decoding (906) vertices of at least one mesh in a first volumetric media frame in a group of volumetric media frames, based on the a set of vertex and mesh attributes associated with the first volumetric media frame; decoding (908) said vertices of said at least one mesh in one or more subsequent volumetric media frames in the group of volumetric media frames by a temporally prediction from the first volumetric media frame, based on the set of vertex and mesh attributes associated with said one or more subsequent volumetric media frames; and rendering (910) said at least one mesh into a reconstructed 3D volumetric media frame.

[0136] Thus, the decoder receives V3C bitstream comprising the V3C-encoded mesh intra frames and mesh delta frames, as well as the signaling related to the encoded mesh information. The decoder decodes the V3C-encoded mesh intra frames and mesh delta frames based on signaling and reconstructs the mesh information from the mesh intra frames and delta coded inter frames. Finally, the decoder reconstructs a 3D object using the derived mesh information and attribute information in V3C bitstream.

[0137] Figure 10 shows an exemplified block chart of such an apparatus illustrating the decoder operations related to temporally predicted mesh decoding. The receiver receives the V3C bitstream (1000), which comprises the V3C-encoded mesh intra frames and mesh delta frames, along with the possible attribute V3C video components. The receiver further receives, from or along said V3C bitstream, signaling related to the encoded mesh information (1002). A V3C bitstream parser (1004) extracts the possible attribute V3C video component information (1006) as defined in ISO/IEC 23090-5. Compressed mesh I- frame(s) (1008) and compressed mesh P-frame(s) (1010) are identified based on signaling received from or along the bitstream, wherein the signaling indicates which type(s) of mesh decoder(s) (1012) should be used to decode I-frames and P-frames, respectively. After decoding, uncompressed I-frames (1014) and P-frames (106) are retrieved and per frame mesh information may be reconstructed, for example as defined in Table 1. Finally, the mesh information is utilized by a volumetric Tenderer (1018) for rendering and displaying the decoded object or scene to the user.

[0138] The embodiments relating to the encoding aspects may be implemented in an apparatus comprising: means for obtaining a plurality of volumetric media frames comprising three-dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; means for obtaining a set of vertex and mesh attributes from the plurality of volumetric media frames; means for comparing the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; means for determining, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; means for determining a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; means for encoding the first volumetric media frame and the related set of vertex and mesh attributes into a bitstream; and means for encoding, for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said one or more subsequent volumetric media frames in said group of volumetric media frames into said bitstream.

[0139] The embodiments relating to the encoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a plurality of volumetric media frames comprising three- dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; obtain a set of vertex and mesh attributes from the plurality of volumetric media frames; compare the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; determine, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; determine a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; encode the first volumetric media frame and the related set of vertex and mesh attributes into a bitstream; and encode, for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said one or more subsequent volumetric media frames in said group of volumetric media frames into said bitstream.

[0140] The embodiments relating to the decoding aspects may be implemented in an apparatus comprising means for receiving a bitstream in a decoder, said bitstream comprising encoded 3D volumetric media frames; means for receiving, either in said bitstream or in a further bitstream, one or more signaling elements associated with encoded mesh information included in the bitstream comprising encoded 3D volumetric media frames; means for decoding, from the mesh information, a set of vertex and mesh attributes associated with each volumetric media frame; means for decoding vertices of at least one mesh in a first volumetric media frame in a group of volumetric media frames, based on the a set of vertex and mesh attributes associated with the first volumetric media frame; means for decoding said vertices of said at least one mesh in one or more subsequent volumetric media frames in the group of volumetric media frames by a temporally prediction from the first volumetric media frame, based on the set of vertex and mesh attributes associated with said one or more subsequent volumetric media frames; and means for rendering said at least one mesh into a reconstructed 3D volumetric media frame. [0141] The embodiments relating to the decoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive a bitstream in a decoder, said bitstream comprising encoded 3D volumetric media frames; receive, either in said bitstream or in a further bitstream, one or more signaling elements associated with encoded mesh information included in the bitstream comprising encoded 3D volumetric media frames; decode, from the mesh information, a set of vertex and mesh attributes associated with each volumetric media frame; decode vertices of at least one mesh in a first volumetric media frame in a group of volumetric media frames, based on the a set of vertex and mesh attributes associated with the first volumetric media frame; decode said vertices of said at least one mesh in one or more subsequent volumetric media frames in the group of volumetric media frames by a temporally prediction from the first volumetric media frame, based on the set of vertex and mesh attributes associated with said one or more subsequent volumetric media frames; and render said at least one mesh into a reconstructed 3D volumetric media frame.

[0142] Such apparatuses may comprise e.g. the functional units disclosed in any of the Figures la, lb, 2a and 2b for implementing the embodiments.

[0143] In the above, some embodiments have been described with reference to encoding. It needs to be understood that said encoding may comprise one or more of the following: encoding source image data into a bitstream, encapsulating the encoded bitstream in a container file and/or in packet(s) or stream(s) of a communication protocol, and announcing or describing the bitstream in a content description, such as the Media Presentation Description (MPD) of ISO/IEC 23009-1 (known as MPEG-DASH) or the IETF Session Description Protocol (SDP). Similarly, some embodiments have been described with reference to decoding. It needs to be understood that said decoding may comprise one or more of the following: decoding image data from a bitstream, decapsulating the bitstream from a container file and/or from packet(s) or stream(s) of a communication protocol, and parsing a content description of the bitstream, [0144] In the above, where the example embodiments have been described with reference to an encoder or an encoding method, it needs to be understood that the resulting bitstream and the decoder or the decoding method may have corresponding elements in them. Likewise, where the example embodiments have been described with reference to a decoder, it needs to be understood that the encoder may have structure and/or computer program for generating the bitstream to be decoded by the decoder.

[0145] In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

[0146] Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

[0147] Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

[0148] The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended examples. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention.

Claims

1. An apparatus comprising: means for obtaining a plurality of volumetric media frames comprising three- dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; means for obtaining a set of vertex and mesh attributes from the plurality of volumetric media frames; means for comparing the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; means for determining, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; means for determining a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; means for encoding the first volumetric media frame and the related set of vertex and mesh attributes into a bitstream; and means for encoding, for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said one or more subsequent volumetric media frames in said group of volumetric media frames into said bitstream.

2. The apparatus according to claim 1, wherein the set of vertex and mesh attributes of a mesh comprises one or more of the following: vertex positions, indices, faces, UV coordinates, additional attributes.

3. The apparatus according to claim 1 or 2, comprising: means for quantizing the set of vertex and mesh attributes for each volumetric media frames before determining the difference of the at least one attribute.

4. The apparatus according to any preceding claim, comprising: means for encoding a unique index for each temporally stable vertex within said group of volumetric media frames as an additional vertex related attribute.

5. The apparatus according to any preceding claim, comprising: means for encoding the first volumetric media frame and the associated set of vertex and mesh attributes with Draco-encoding; means for encoding the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said inter frames with Draco-encoding; and means for including the encoded data into a Visual Volumetric Video-based Coding (V3C) bitstream.

6. The apparatus according to any of claims 1 - 4, comprising: means for encoding the first volumetric media frame and the associated set of vertex and mesh attributes with Draco-encoding; means for encoding the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said inter frames with an encoding different from Draco-encoding; and means for including the encoded data into a Visual Volumetric Video-based Coding (V3C) bitstream.

7. The apparatus according to any preceding claim, comprising: means for encoding information of obsolete, new or updated mesh entities by adding attributes for describing mesh entity status.

8. The apparatus according to any preceding claim, wherein the mesh information configured to be carried in a V3C bitstream in one of the following containers: - a V3C unit configured to store mesh information,

- a NAL unit configured to store mesh information inside atlas sub-bitstream,

- a SEI message configured to store the mesh information, or

- one or more raw byte sequence payload (RBSP) patch modes configured to store mesh data inside V3C NAL units.

9. The apparatus according to any preceding claim, comprising: means for including an indication about the mesh compression technique used to compress the intra frames and inter frames in or along said bitstream.

10. The apparatus according to claim 9, wherein said indication about the mesh compression technique is configured to be carried out by at least one syntax element included in as an extension to the V3C parameter set syntax structure.

11. A method comprising: obtaining a plurality of volumetric media frames comprising three-dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; obtaining a set of vertex and mesh attributes from the plurality of volumetric media frames; comparing the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; determining, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; determining a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; encoding the first volumetric media frame and the related set of vertex and mesh attributes into a bitstream; and encoding, for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said one or more subsequent volumetric media frames in said group of volumetric media frames into said bitstream.

12. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a plurality of volumetric media frames comprising three-dimensional (3D) data content encoded into 3D polygon meshes defined by a plurality of vertices; obtain a set of vertex and mesh attributes from the plurality of volumetric media frames; compare the set of vertex and mesh attributes across the plurality of volumetric media frames so as to identify temporally stable vertex indices and faces; determine, among a group of volumetric media frames identified to comprise temporally stable vertex indices and faces, a difference of at least one attribute in the set of vertex and mesh attributes per each vertex between the volumetric media frames in said group of volumetric media frames; determine a first volumetric media frame in said group of volumetric media frames as an intra frame for at least one mesh and one or more subsequent volumetric media frames in said group of volumetric media frames as inter frames for said at least one mesh; encode the first volumetric media frame and the related set of vertex and mesh attributes into a bitstream; and encode, for said at least one mesh, the difference of at least one attribute in the set of vertex and mesh attributes per each vertex of the mesh for said one or more subsequent volumetric media frames in said group of volumetric media frames into said bitstream.

13. A method comprising: receiving a bitstream in a decoder, said bitstream comprising encoded 3D volumetric media frames; receiving, either in said bitstream or in a further bitstream, one or more signaling elements associated with encoded mesh information included in the bitstream comprising encoded 3D volumetric media frames; decoding, from the mesh information, a set of vertex and mesh attributes associated with each volumetric media frame; decoding vertices of at least one mesh in a first volumetric media frame in a group of volumetric media frames, based on the set of vertex and mesh attributes associated with the first volumetric media frame; decoding said vertices of said at least one mesh in one or more subsequent volumetric media frames in the group of volumetric media frames by a temporally prediction from the first volumetric media frame, based on the set of vertex and mesh attributes associated with said one or more subsequent volumetric media frames; and rendering said at least one mesh into a reconstructed 3D volumetric media frame.

14. An apparatus comprising: means for receiving a bitstream in a decoder, said bitstream comprising encoded 3D volumetric media frames; means for receiving, either in said bitstream or in a further bitstream, one or more signaling elements associated with encoded mesh information included in the bitstream comprising encoded 3D volumetric media frames; means for decoding, from the mesh information, a set of vertex and mesh attributes associated with each volumetric media frame; means for decoding vertices of at least one mesh in a first volumetric media frame in a group of volumetric media frames, based on the a set of vertex and mesh attributes associated with the first volumetric media frame; means for decoding said vertices of said at least one mesh in one or more subsequent volumetric media frames in the group of volumetric media frames by a temporally prediction from the first volumetric media frame, based on the set of vertex and mesh attributes associated with said one or more subsequent volumetric media frames; and means for rendering said at least one mesh into a reconstructed 3D volumetric media frame.

15. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive a bitstream in a decoder, said bitstream comprising encoded 3D volumetric media frames; receive, either in said bitstream or in a further bitstream, one or more signaling elements associated with encoded mesh information included in the bitstream comprising encoded 3D volumetric media frames; decode, from the mesh information, a set of vertex and mesh attributes associated with each volumetric media frame; decode vertices of at least one mesh in a first volumetric media frame in a group of volumetric media frames, based on the set of vertex and mesh attributes associated with the first volumetric media frame; decode said vertices of said at least one mesh in one or more subsequent volumetric media frames in the group of volumetric media frames by a temporally prediction from the first volumetric media frame, based on the set of vertex and mesh attributes associated with said one or more subsequent volumetric media frames; and render said at least one mesh into a reconstructed 3D volumetric media frame.