WO2023144445A1 - A method, an apparatus and a computer program product for video encoding and video decoding - Google Patents

A method, an apparatus and a computer program product for video encoding and video decoding Download PDF

Info

Publication number
WO2023144445A1
WO2023144445A1 PCT/FI2023/050045 FI2023050045W WO2023144445A1 WO 2023144445 A1 WO2023144445 A1 WO 2023144445A1 FI 2023050045 W FI2023050045 W FI 2023050045W WO 2023144445 A1 WO2023144445 A1 WO 2023144445A1
Authority
WO
WIPO (PCT)
Prior art keywords
patch
bitstream
mesh
texture
vertex
Prior art date
Application number
PCT/FI2023/050045
Other languages
French (fr)
Inventor
Sebastian Schwarz
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2023144445A1 publication Critical patent/WO2023144445A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards

Definitions

  • the present solution generally relates to encoding and decoding of volumetric video.
  • Volumetric video data represents a three-dimensional (3D) scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality), and MR (Mixed Reality) applications.
  • AR Augmented Reality
  • VR Virtual Reality
  • MR Magnetic Reality
  • Such data describes geometry (Shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), and any possible temporal transformations of the geometry and attributes at given time instances (like frames in two-dimensional (2D) video).
  • Volumetric video can be generated from 3D models, also referred to as volumetric visual objects, i.e., CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Examples of representation formats for volumetric data comprise triangle meshes, point clouds, or voxels.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.
  • volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.
  • 3D data acquisition devices have enabled reconstruction of highly detailed volumetric video representations of natural scenes.
  • Infrared, lasers, time-of-flight, and structured light are examples of devices that can be used to construct 3D video data.
  • Representation of the 3D data depends on how the 3D data is used.
  • Dense Voxel arrays have been used to represent volumetric medical data.
  • polygonal meshes are extensively used.
  • Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold.
  • Another way to represent 3D data is coding, this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multiview plus depth is the use of elevation maps, and multi-level surface maps.
  • an apparatus for encoding comprising means for receiving a dynamic three-dimensional (3D) mesh representing a 3D object, and being formed of vertices, edges and faces; means for subdividing the dynamic 3D mesh into subareas; means for projecting the subareas onto two-dimensional (2D) planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; means for encoding 3D vertex positions, texture video bitstream and patch metadata to a Visual Volumetric Video-based Coding (V3C) bitstream; for one or more patches in a texture video bitstream, means for encoding connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and means for transmitting the bitstreams to a decoder.
  • V3C Visual Volumetric Video-based Coding
  • an apparatus for decoding comprising vmeans for receiving an encoded V3C bitstream, comprising patch metadata and texture video bitstream; means for receiving one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; means for decoding from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; means for decoding vertex connectivity information from the bitstream; and means for reconstructing a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
  • a method for encoding comprising: receiving a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; subdividing the dynamic 3D mesh into subareas; projecting the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; encoding 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, encoding connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and transmitting the bitstreams to a decoder.
  • a method for decoding comprising: receiving an encoded V3C bitstream, comprising patch metadata and texture video bitstream; receiving one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; decoding from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; decoding vertex connectivity information from the bitstream; and reconstructing a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; subdivide the dynamic 3D mesh into subareas; project the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; encode 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, encode connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and transmit the bitstreams to a decoder.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive an encoded V3C bitstream, comprising patch metadata and texture video bitstream; receive one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; decode from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; decode vertex connectivity information from the bitstream; and reconstruct a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
  • a seventh aspect there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; subdivide the dynamic 3D mesh into subareas; project the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; encode 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, encode connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and transmit the bitstreams to a decoder.
  • computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive an encoded V3C bitstream, comprising patch metadata and texture video bitstream; receive one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; decode from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; decode vertex connectivity information from the bitstream; and reconstruct a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
  • an apparatus for encoding further comprises means for encoding partial information on vertex 3D positions into a bitstream, where partial information comprises one or more of the following: absolute two out of 3 coordinates in 3D space; relative residual 3D coordinates to 3D coordinates specified in the patch metadata; 2D texture coordinates corresponding to an associated texture map.
  • apparatus for encoding further comprises means for encoding information on surface normals of a patch into a bitstream.
  • apparatus for encoding further comprises means for creating a geometry patch and a mesh-patch, wherein a mesh-patch comprises texture coordinates and a geometry patch comprises a 3D coordinate value being represented in said texture coordinates.
  • apparatus for encoding further comprises means for signaling the connectivity of 3D vertices represented in the patch as additional metadata in or along the V3C atlas data.
  • apparatus for decoding further comprises means for decoding information on surface normals of a patch from a bitstream.
  • apparatus for decoding further comprises means for decoding texture coordinates from a mesh-patch and means for decoding a value being represented in said texture coordinates from a geometry patch.
  • apparatus for decoding further comprises means for decoding the connectivity of 3D vertices represented in the patch from metadata in or along the V3C atlas data.
  • the computer program product is embodied on a non-transitory computer readable medium.
  • Fig. 1 shows an example of a compression process of a volumetric video
  • Fig. 2 shows an example of a de-compression of a volumetric video
  • Fig. 3a shows an example of a volumetric media conversion at an encoder
  • Fig. 3b shows an example of a volumetric media reconstruction at a decoder
  • Fig. 4 shows an example of block to patch mapping
  • Fig. 5a shows an example of an atlas coordinate system
  • Fig. 5b shows an example of a local 3D patch coordinate system
  • Fig. 5c shows an example of a final target 3D coordinate system
  • Fig. 6 shows a V-PCC extension for mesh encoding
  • Fig. 7 shows a V-PCC extension for mesh decoding
  • Fig. 8 is a flowchart illustrating a method for encoding according to an embodiment
  • Fig. 9 is a flowchart illustrating a method for decoding according to another embodiment.
  • Fig. 10 shown an example of an apparatus.
  • the present embodiments relate to encoding, signalling, and rendering a volumetric video based on mesh coding.
  • the aim of the present solution is to improve the industry standard for reconstructing mesh surfaces for volumetric video.
  • This specification discloses implementation methods to ensure temporal stabilization of mesh UV textures which in consequence increase compression efficiency of the encoding pipeline.
  • Volumetric video data represents a three-dimensional scene or object and can be used as input for AR, VR and MR applications. Such data describes geometry (shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), plus any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video).
  • Volumetric video is either generated from 3D models, i.e., CGI, or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Representation formats for such volumetric data are triangle meshes, point clouds, or voxels.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.
  • volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.
  • 3D data acquisition devices have enabled reconstruction of highly detailed volumetric video representations or natural scenes.
  • Infrared, lasers, time-of-flight, and structured light are all examples of devices that can be used to construct 3D video data.
  • Representation of the 3D data depends on how the 3D data is used.
  • Dense Voxel arrays have been used to represent volumetric medical data.
  • polygonal meshes are extensively used.
  • Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold.
  • Another way to represent 3D data is coding this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.
  • Visual volumetric video comprising a sequence of visual volumetric frames, if uncompressed, may be represented by a large amount of data, which can be costly in terms of storage and transmission. This has led to the need for a high coding efficiency standard for the compression of visual volumetric data.
  • Figure 1 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC).
  • PCC MPEG Point Cloud Coding
  • the process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
  • the patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error.
  • the normal at every point can be estimated.
  • An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:
  • each point may be associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal and the plane normal).
  • the initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors.
  • the final step may comprise extracting patches by applying a connected component extraction procedure.
  • Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105.
  • the packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch.
  • T may be a user- defined parameter.
  • Parameter T may be encoded in the bitstream and sent to the decoder.
  • W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded.
  • the patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
  • the geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively.
  • the image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images.
  • each patch may be projected onto two images, referred to as layers.
  • H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v).
  • the first layer also called a near layer, stores the point o H u, v) with the lowest depth DO.
  • the second layer referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, D0+A], where is a user-defined parameter that describes the surface thickness.
  • the generated videos may have the following characteristics:
  • the geometry video is monochromatic.
  • the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
  • the geometry images and the texture images may be provided to image padding 107.
  • the image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images.
  • the occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively.
  • the occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.
  • the padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression.
  • each block of TxT e.g., 16x16 pixels is compressed independently. If the block is empty (i.e., unoccupied, i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e., occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e., edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
  • the padded geometry images and padded texture images may be provided for video compression 108.
  • the generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters.
  • the video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102.
  • the smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
  • the patch may be associated with auxiliary information being encoded/decoded for each patch as metadata.
  • the auxiliary information may comprise index of the projection plane, 2D bounding volume, for example a bounding box, 3D location of the patch.
  • 2D bounding volume for example a bounding box
  • 3D location of the patch for example, the following metadata may be encoded/decoded for every patch:
  • mapping information providing for each TxT block its associated patch index may be encoded as follows:
  • L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block.
  • the order in the list is the same as the order used to encode the 2D bounding boxes.
  • L is called the list of candidate patches.
  • the empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.
  • the occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • One cell of the 2D grid produces a pixel during the image generation process.
  • the occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0).
  • the remaining blocks may be encoded as follows:
  • the occupancy map can be encoded with a precision of a BOxBO blocks.
  • the compression process may comprise one or more of the following example operations:
  • Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block.
  • a value 1 associated with a sub-block if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
  • a binary information may be encoded for each TxT block to indicate whether it is full or not.
  • an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
  • FIG. 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC).
  • a de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202.
  • the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204.
  • Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information.
  • the point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.
  • the reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts.
  • the implemented approach moves boundary points to the centroid of their nearest neighbors.
  • the smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202.
  • the texture reconstruction 207 outputs a reconstructed point cloud.
  • the texture values for the texture reconstruction are directly read from the texture images.
  • the point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers.
  • the 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (30, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth 3(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
  • V3C Visual volumetric video-based Coding
  • ISO/IEC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video)).
  • V3C will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1-8 of the current V-PCC text).
  • ISO/IEC 23090-12 will refer to this common part.
  • ISO/IEC 23090-5 will be renamed to V3C PCC, ISO/IEC 23090-12 renamed to V3C MIV.
  • V3C enables the encoding and decoding processes of a variety of volumetric media by using video and image coding technologies. This is achieved through first a conversion of such media from their corresponding 3D representation to multiple 2D representations, also referred to as V3C video components, before coding such information.
  • Such representations may include occupancy, geometry, and attribute components.
  • the occupancy component can inform a V3C decoding and/or rendering system of which samples in the 2D components are associated with data in the final 3D representation.
  • the geometry component contains information about the precise location of 3D data in space, while attribute components can provide additional properties, e.g., texture or material information, of such 3D data.
  • Figures 3a and 3b An example is shown in Figures 3a and 3b, where Figure 3a presents volumetric media conversion at an encoder, and where Figure 3b presents volumetric media reconstruction at a decoder side.
  • the 3D media is converted to a series of 2D representations: occupancy 301 , geometry 302, and attributes 303. Additional information may also be included in the bitstream to enable inverse reconstruction.
  • An atlas 304 consists of multiple elements, named as patches. Each patch identifies a region in all available 2D components and contains information necessary to perform the appropriate inverse projection of this region back to the 3D space. The shape of such regions is determined through a 2D bounding volume associated with each patch as well as their coding order. The shape of these regions is also further refined after the consideration of the occupancy information. Atlases may be partitioned into patch packing blocks of equal size.
  • FIG. 4 shows an example of block to patch mapping with 4 projected patches onto an atlas when asps_patch_precedence_order_flag is equal to 0. Projected points are represented with dark grey. The area that does not contain any projected points is represented with light grey. Patch packing blocks are represented with dashed lines. The number inside each patch packing block represents the patch index of the patch to which it is mapped.
  • Axes orientations are specified for internal operations. For instance, the origin of the atlas coordinates is located on the top-left corner of the atlas frame. For the reconstruction step, an intermediate axes definition for a local 3D patch coordinate system is used. The 3D local patch coordinate system is then converted to the final target 3D coordinate system using appropriate transformation steps.
  • Figure 5a shows an example of a single patch 520 packed onto an atlas image 510.
  • This patch 520 is then converted to a local 3D patch coordinate system (U, V, D) defined by the projection plane with origin O’, tangent (U), bi-tangent (V), and normal (D) axes.
  • the projection plane is equal to the sides of an axis-aligned 3D bounding volume 530, as shown in Figure 5b.
  • the location of the bounding volume 530 in the 3D model coordinate system can be obtained by adding offsets TilePatch3dOffsetU, TilePatch3DOffsetV, and TilePatch3DOffsetD, as illustrated in Figure 5c.
  • Coded V3C video components are referred to in this disclosure as video bitstreams, while a coded atlas is referred to as the atlas bitstream.
  • Video bitstreams and atlas bitstreams may be further split into smaller units, referred to here as video and atlas sub-bitstreams, respectively, and may be interleaved together, after the addition of appropriate delimiters, to construct a V3C bitstream.
  • V3C patch information is contained in atlas bitstream, atlas_sub_bitstream(), which contains a sequence of NAL units.
  • NAL unit is specified to format data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes.
  • a NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet-oriented transport and sample streams is identical except that in the sample stream format specified in Annex D of ISO/IEC 23090-5 each NAL unit can be preceded by an additional element that specifies the size of the NAL unit.
  • NAL units in atlas bitstream can be divided to atlas coding layer (ACL) and non-atlas coding layer (non-ACL) units.
  • ACL atlas coding layer
  • non-ACL non-atlas coding layer
  • nal_unit_type specifies the type of the RBSP (Raw Byte Sequence Payload) data structure contained in the NAL unit as specified in Table 4 of ISO/IEC 23090-5.
  • nal_layer_id specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies.
  • the value of nal_layer_id shall be in the range of 0 to 62, inclusive.
  • the value of 63 may be specified in the future by ISO/IEC.
  • Decoders conforming to a profile specified in Annex A of ISO/IEC 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of nal_layer_id not equal to 0.
  • rbsp_byte[ i ] is the i-th byte of an RBSP.
  • An RBSP is specified as an ordered sequence of bytes as follows:
  • the RBSP contains a string of data bits (SODB) as follows:
  • the RBSP is also empty.
  • the RBSP contains the SODB as follows: o
  • the first byte of the RBSP contains the first (most significant, leftmost) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain.
  • o The rbsp_trailing_bits( ) syntax structure is present after the SODB as follows:
  • the first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any).
  • the next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit).
  • One or more cabac_zero_word 16-bit syntax elements equal to 0x0000 may be present in some RBSPs after the rbsp_trailing_bits( ) at the end of the RBSP.
  • Syntax structures having these RBSP properties are denoted in the syntax tables using an "_rbsp" suffix. These structures are carried within NAL units as the content of the rbsp_byte[ i ] data bytes. As an example, the following may be considered as typical content:
  • Atlas_frame_parameter_set_rbsp( ) which is used to carry parameters related to atlas on a frame level and are valid for one or more atlas frames.
  • sei_rbsp( ) used to carry SEI (Supplemental Enhancement Information) messages in NAL units.
  • the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, rightmost) bit equal to 1 , and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0.
  • the data necessary for the decoding process is contained in the SODB part of the RBSP.
  • atlas_tile_group_laye_rbsp() contains metadata information for a list off tile groups, which represent sections of frame. Each tile group may contain several patches for which the metadata syntax is described below.
  • Annex F of V3C V-PCC specification (23090-5) describes different SEI messages that have been defined for V3C MIV purposes. SEI messages assist in processes related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non- essential. V3C SEI messages are signaled in sei_rspb() which is documented below.
  • Non-essential SEI messages are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance.
  • non-essential SEI messages When present in the bitstream, non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5). When the content of a non-essential SEI message is conveyed for the application by some means other than presence within the bitstream, the representation of the content of the SEI message is not required to use the same syntax specified in annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are present in the bitstream are counted.
  • Essential SEI messages are an integral part of the V3C bitstream and should not be removed from the bitstream.
  • the essential SEI messages are categorized into two types:
  • Type-A essential SEI messages These SEIs contain information required to check bitstream conformance and for output timing decoder conformance. Every V3C decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance.
  • Type-B essential SEI messages V3C decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type- B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes.
  • a polygon mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral object in 3D computer graphics and solid modelling. The faces usually consist of triangles (triangle mesh), quadrilaterals (quads), or other simple convex polygons (n-gons), since this simplifies rendering, but may also be more generally composed of concave polygons, or even polygons with holes.
  • Objects created with polygon meshes are represented by different types of elements. These include vertices, edges, faces, polygons, and surfaces. In many applications, only vertices, edges and either faces or polygons are stored.
  • Polygon meshes are defined by the following elements:
  • Vertex A position in 3D space defined as (x, y, z) along with other information such as color (r, g, b), normal vector and texture coordinates.
  • Edge A connection between two vertices.
  • a polygon A closed set of edges, in which a triangle face has three edges, and a quad face has four edges.
  • a polygon is a coplanar set of faces. In systems that support multi-sided faces, polygons and faces are equivalent.
  • Mathematically a polygonal mesh may be considered an unstructured grid, or undirected graph, with additional properties of geometry, shape and topology.
  • Groups Some mesh formats contain groups, which define separate elements of the mesh, and are useful for determining separate subobjects for skeletal animation or separate actors for non-skeletal animation.
  • UV coordinates Most mesh formats also support some form of llV coordinates which are a separate 2D representation of the mesh "unfolded" to show what portion of a 2-dimensional texture map applies to different polygons of the mesh. It is also possible for meshes to contain other vertex attribute information such as color, tangent vectors, weight maps to control animation, etc. (sometimes also called channels).
  • Figure 6 and Figure 7 show the extensions to the V3C encoder and decoder to support mesh encoding and mesh decoding.
  • the input mesh data 610 is demultiplexed 620 into vertex coordinate and attributes data 625 and mesh connectivity 627, where the mesh connectivity comprises vertex connectivity information.
  • the vertex coordinate and attributes data 625 is coded using MPEG-I V-PCC 630 (such as shown in Figure 1 ), whereas the mesh connectivity data 627 is coded in mesh connectivity encoder 635 as auxiliary data. Both of these are multiplexed 640 to create the final compressed output bitstream 650. Vertex ordering is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC to reorder the vertices for optimal mesh connectivity encoding.
  • the input bitstream 750 is demultiplexed 740 to generate the compressed bitstreams for vertex coordinates and attributes data and mesh connectivity.
  • the vertex coordinates and attributes data are decompressed using MPEG-I V-PCC decoder 730.
  • Vertex reordering 725 is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC decoder 730 to match the vertex order at the encoder.
  • Mesh connectivity data is decompressed using mesh connectivity decoder 735.
  • the decompressed data is multiplexed 720 to generate the reconstructed mesh 710.
  • An edgebreaker is an algorithm for efficient compression of 3D meshes.
  • the edgebreaker encodes the connectivity of the triangle meshes. Because of the performance and simplicity of edgebreaker, it has been adopted in popular compression libraries.
  • edgebreaker is at the core of the of the Google Draco compression library.
  • Google Draco is an open-source library for compression and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics.
  • the algorithm traverses through all triangles all triangles of the mesh in a deterministic, spiral-like way, where:
  • Each new triangle is next to an already encoded one. This allows efficient compression of vertex coordinates and other attributes, such as normals. Instead of storing the absolute values, they can be predicted them from an adjacent triangle (using parallelogram prediction) and only store the difference between predicted and actual values, which is generally small.
  • edgebreaker uses just five possible symbols (“C”, “L”, “E”, “R”, “S”) per triangle (forming the so called CLERS string).
  • - video coding tools may introduce an error that is small in 2D video domain, but may have significant impact on 3D reconstruction.
  • the present embodiments are targeted to the concept of mesh-patch, meshpatch encoding using edgebreaker, and related in signalling in V3C.
  • mesh-patches can be signaled together with geometry patches, utilizing the best of both worlds.
  • the present embodiments provide a method to encode geometry, occupancy, and connectivity information in a mesh-patch.
  • the concept of V3C patch creation, as described above, remains untouched, i.e., mesh-patches and geometry patches describe the same data.
  • mesh-patches provide improved coding efficiency and additional support of important features required for dynamic mesh compression.
  • a geometry patch is a grayscale image together with some patch metadata (atlas data).
  • the pixel u,v coordinates, the pixel value, and the metadata, allow for 3D reconstruction. Additional information is required to specify if a pixel of the geometry patch is valid. For the use case of mesh compression, there is no approach for carrying vertex connectivity information.
  • a mesh-patch is the raw data of the same 3D reconstruction, as well as the same patch metadata.
  • the raw data can be significantly simplified:
  • the patch metadata can be used to reduce the entropy in X, Y, and Z vertex values, thus improving coding efficiency;
  • an encoder receives a dynamic 3D mesh, and subdivides the mesh into smaller meshes or similar subareas.
  • a mesh is a collection of vertices, edges and faces that define a shape of a 3D object. Therefore, the smaller meshes being divided from the received 3D mesh, are also collections of vertices, edges and faces.
  • the encoder projects these subareas onto 2D planes to form a texture image and an initial geometry image, and also the related patch metadata.
  • the initial geometry image has the same resolution as the texture image, but has a sparse distribution. Only pixel position with a projected 3D vertex carries a value. The value z at this pixel position (u, v), together with the position X, Y, Z of the patch in 3D space (carried in the patch metadata) enables reconstructing each vertex back into the 3D space, at position Xv, Yv, Zv as follows:
  • the reconstructed vertices are coded using an edgebreaker and then multiplexed in a V3C bitstream.
  • the texture coordinates u, v required for the texture mapping need to be included in the edgebreaker data.
  • the reconstructed vertices are coded using edgebreaker and then multiplexed in a V3C bitstream.
  • Patch location X and Y is carried in the patch metadata.
  • u, v, and z value are encoded for each vertex.
  • X, Y, Z are carried in the patch metadata.
  • the value ranges to be encoded in the edgebreaker can be significantly reduced, thus improving coding efficiency. It is appreciated that the subdivision of a 3D mesh into smaller meshes or similar data already improves coding efficiency as entropy of the encoded data is reduced.
  • additional information such as surface normals can be included in the edgebreaker coded information.
  • the encoder creates both geometry patches and mesh-patches.
  • mesh-patches can be simplified, only carrying u and v coordinates, as the value for z is given from the corresponding pixel value zg in the geometry patch. This approach improves coding efficiency of the edgebreaker by about 20 percent at the cost of additional geometry video signalling.
  • the V3C mesh encoder stores edgebreaker data as follows:
  • encoder uses edgebreaker to generate order of vertices (with associated uv attributes) and CLERS string providing connectivity information;
  • vertices may or may be not encoded with parallelogram prediction.
  • each vertex is stored as V3C patch data of size 1x1 ;
  • V3 patch data units (patch_data_unit syntax structure in ISO/IEC 23090-5) where semantics of the patch data is as follows: o pdu_2d_pos_x stores the u value of texture map to the vertex; o pdu_2d_pos_y stores the v value of texture map to the vertex; o pdu_2d_size_x_minus1 equal to 0, ue(v) equal to 1 bit; o pdu_2d_size_y_minus1 equal to 0 ue(v) equal to 1 bit; o pdu_3d_offset_u stores x value of a vertex; o pdu_3d_offset_v stores y value of a vertex; o pdu_3d_offset_d stores z value of a vertex; o pdu_projection
  • Patches in an inter frame are stored as any type of V3C patch data units (i.e., patch_data_unit, skip_patch_data_unit, merge_patch_data_unit, or inter_patch_data_unit) with semantics aligned to semantics described for patch_data_unit in previous bulletpoint.
  • V3C patch data units i.e., patch_data_unit, skip_patch_data_unit, merge_patch_data_unit, or inter_patch_data_unit
  • - CLERS strings i.e., connectivity information
  • o a new patch type that would store a string of bits
  • the string representing the history e.g., the following bit mapping to history string could be used
  • o a new NAL unit type that would always precede NAL unit containing the corresponding atlas_tile_layer_rbsp syntax element.
  • a decoder receives and decodes a bitstream comprising atlas metadata, a texture video bitstream and a sequence of edgebreaker bitstreams. Each edgebreaker bitstream corresponds to one patch in the texture video bitstream.
  • the decoder recreates the 3D vertex positions and texture mapping coordinates in each edgebreaker bitstream, potentially with additional information from the atlas/patch metadata.
  • the vertex connectivity information is directly decoded from the edgebreaker information.
  • the decoder reconstructs the mesh according to the decoded vertices position, connectivity, and texture mapping information.
  • pdu_3d_eb_enc_flag indicates if the edgebreaker information is present for geometry reconstruction of patch.
  • pdu_eb_bs_idx[ tilelD ][ patch Idx ] indicates the bitstream index of the edgebreaker bitstream corresponding to patch patchldx of Tile tilelD.
  • the edgebreaker data is carried directly in the V3C atlas data as follows:
  • pdu_eb_clers_data_length indicates the length of data containing the CLERS string in bits. It is appreciated that new patch type would have very similar syntax structure edgebreaker_rbsp syntax structure but the trailing bits and for brevity it is not provided.
  • the method for encoding generally comprises receiving 805 a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; subdividing 810 the dynamic 3D mesh into subareas; projecting 815 the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; encoding 8203D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, encoding 825 connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and transmitting 830 the bitstreams to a decoder.
  • Each of the steps can be implemented by a respective module of a computer system.
  • An apparatus comprises means for receiving a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; means for subdividing the dynamic 3D mesh into subareas; means for projecting the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; means for encoding 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, means for encoding connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and means for transmitting the bitstreams to a decoder.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 8 according to various embodiments.
  • the method for decoding generally comprises receiving 940 an encoded V3C bitstream, comprising patch metadata and texture video bitstream; receiving 945 one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; decoding 950 from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; decoding 955 vertex connectivity information from the bitstream; and reconstructing 960 a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
  • Each of the steps can be implemented by a respective module of a computer system.
  • An apparatus comprises means for receiving an encoded V3C bitstream, comprising patch metadata and texture video bitstream; means for receiving one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; means for decoding from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; means for decoding vertex connectivity information from the bitstream; and means for reconstructing a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 9 according to various embodiments.
  • Figure 10 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec.
  • the electronic device may comprise an encoder or a decoder.
  • the electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device.
  • the electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer.
  • the device may be also comprised as part of a head-mounted display device.
  • the apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the apparatus may further comprise a camera 42 capable of recording or capturing images and/or video.
  • the camera 42 may be a multi-lens camera system having at least two camera sensors.
  • the camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing.
  • the apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
  • the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a IIICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system, or a wireless local area network.
  • the apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
  • the apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.
  • a wired connection for example an electrical cable or an optical fiber connection.
  • the various embodiments may provide advantages. For example, the present embodiments may improve coding efficiency. Also, the present embodiment may reduce decoding complexity (no additional occupancy data). Also, the present embodiments may support for essential mesh features such as connectivity signalling, surface normals, etc.
  • a device may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The embodiments relate to a method for encoding,comprising: receiving a dynamic 3D mesh representing a 3Dobject, and being formed of vertices, edges and faces;subdividing the dynamic 3D mesh into subareas; projecting the subareas onto 2D planes, wherein a pixel position on a 2Dplane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; encoding 3D vertex positions, texture video bitstream and patch metadata to aV3C bitstream; for one or more patches in a texture video bitstream, encoding connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and transmitting the bitstreams to a decoder. In addition, the embodiments relate to a method for decoding, and technical equipment for implementing the methods.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING
Technical Field
The present solution generally relates to encoding and decoding of volumetric video.
Background
Volumetric video data represents a three-dimensional (3D) scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality), and MR (Mixed Reality) applications. Such data describes geometry (Shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), and any possible temporal transformations of the geometry and attributes at given time instances (like frames in two-dimensional (2D) video). Volumetric video can be generated from 3D models, also referred to as volumetric visual objects, i.e., CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Examples of representation formats for volumetric data comprise triangle meshes, point clouds, or voxels. Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.
Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.
Increasing computational resources and advances in 3D data acquisition devices have enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight, and structured light are examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding, this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multiview plus depth is the use of elevation maps, and multi-level surface maps.
Summary
The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.
Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.
According to a first aspect, there is provided an apparatus for encoding comprising means for receiving a dynamic three-dimensional (3D) mesh representing a 3D object, and being formed of vertices, edges and faces; means for subdividing the dynamic 3D mesh into subareas; means for projecting the subareas onto two-dimensional (2D) planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; means for encoding 3D vertex positions, texture video bitstream and patch metadata to a Visual Volumetric Video-based Coding (V3C) bitstream; for one or more patches in a texture video bitstream, means for encoding connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and means for transmitting the bitstreams to a decoder. According to a second aspect, there is provided an apparatus for decoding comprising vmeans for receiving an encoded V3C bitstream, comprising patch metadata and texture video bitstream; means for receiving one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; means for decoding from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; means for decoding vertex connectivity information from the bitstream; and means for reconstructing a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
According to a third aspect, there is provided a method for encoding, comprising: receiving a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; subdividing the dynamic 3D mesh into subareas; projecting the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; encoding 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, encoding connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and transmitting the bitstreams to a decoder.
According to a fourth aspect, there is provided a method for decoding, comprising: receiving an encoded V3C bitstream, comprising patch metadata and texture video bitstream; receiving one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; decoding from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; decoding vertex connectivity information from the bitstream; and reconstructing a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
According to a fifth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; subdivide the dynamic 3D mesh into subareas; project the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; encode 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, encode connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and transmit the bitstreams to a decoder.
According to a sixth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive an encoded V3C bitstream, comprising patch metadata and texture video bitstream; receive one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; decode from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; decode vertex connectivity information from the bitstream; and reconstruct a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
According to a seventh aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; subdivide the dynamic 3D mesh into subareas; project the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; encode 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, encode connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and transmit the bitstreams to a decoder. According to an eighth aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive an encoded V3C bitstream, comprising patch metadata and texture video bitstream; receive one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; decode from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; decode vertex connectivity information from the bitstream; and reconstruct a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
According to an embodiment, an apparatus for encoding further comprises means for encoding partial information on vertex 3D positions into a bitstream, where partial information comprises one or more of the following: absolute two out of 3 coordinates in 3D space; relative residual 3D coordinates to 3D coordinates specified in the patch metadata; 2D texture coordinates corresponding to an associated texture map.
According to an embodiment, apparatus for encoding further comprises means for encoding information on surface normals of a patch into a bitstream.
According to an embodiment, apparatus for encoding further comprises means for creating a geometry patch and a mesh-patch, wherein a mesh-patch comprises texture coordinates and a geometry patch comprises a 3D coordinate value being represented in said texture coordinates.
According to an embodiment, apparatus for encoding further comprises means for signaling the connectivity of 3D vertices represented in the patch as additional metadata in or along the V3C atlas data.
According to an embodiment, apparatus for decoding further comprises means for decoding information on surface normals of a patch from a bitstream. According to an embodiment, apparatus for decoding further comprises means for decoding texture coordinates from a mesh-patch and means for decoding a value being represented in said texture coordinates from a geometry patch.
According to an embodiment, apparatus for decoding further comprises means for decoding the connectivity of 3D vertices represented in the patch from metadata in or along the V3C atlas data.
According to an embodiment, the computer program product is embodied on a non-transitory computer readable medium.
Description of the Drawings
In the following, various embodiments will be described in more detail with reference to the appended drawings, in which
Fig. 1 shows an example of a compression process of a volumetric video;
Fig. 2 shows an example of a de-compression of a volumetric video;
Fig. 3a shows an example of a volumetric media conversion at an encoder;
Fig. 3b shows an example of a volumetric media reconstruction at a decoder;
Fig. 4 shows an example of block to patch mapping;
Fig. 5a shows an example of an atlas coordinate system;
Fig. 5b shows an example of a local 3D patch coordinate system;
Fig. 5c shows an example of a final target 3D coordinate system;
Fig. 6 shows a V-PCC extension for mesh encoding; Fig. 7 shows a V-PCC extension for mesh decoding;
Fig. 8 is a flowchart illustrating a method for encoding according to an embodiment;
Fig. 9 is a flowchart illustrating a method for decoding according to another embodiment; and
Fig. 10 shown an example of an apparatus.
Description of Example Embodiments
The present embodiments relate to encoding, signalling, and rendering a volumetric video based on mesh coding. The aim of the present solution is to improve the industry standard for reconstructing mesh surfaces for volumetric video. This specification discloses implementation methods to ensure temporal stabilization of mesh UV textures which in consequence increase compression efficiency of the encoding pipeline.
The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well- known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment in included in at least one embodiment of the disclosure.
Volumetric video data represents a three-dimensional scene or object and can be used as input for AR, VR and MR applications. Such data describes geometry (shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), plus any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video is either generated from 3D models, i.e., CGI, or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Representation formats for such volumetric data are triangle meshes, point clouds, or voxels. Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.
Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.
Increasing computational resources and advances in 3D data acquisition devices have enabled reconstruction of highly detailed volumetric video representations or natural scenes. Infrared, lasers, time-of-flight, and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.
In the following, a short reference of ISO/IEC DIS 23090-5 Visual Volumetric Video-based Coding (V3C) and Video-based Point Cloud Compression (V- PCC) 2nd Edition is given. Visual volumetric video comprising a sequence of visual volumetric frames, if uncompressed, may be represented by a large amount of data, which can be costly in terms of storage and transmission. This has led to the need for a high coding efficiency standard for the compression of visual volumetric data.
Figure 1 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
The patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:
- (1.0, 0.0, 0.0),
- (0.0, 1.0, 0.0),
- (0.0, 0.0, 1.0),
- (-1 .0, 0.0, 0.0),
- (0.0, -1.0, 0.0), and
- (0.0, 0.0, -1.0)
More precisely, each point may be associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal and the plane normal).
The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.
Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105. The packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user- defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.
The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point o H u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, D0+A], where is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:
• Geometry: WxH YUV420-8bit,
• Texture: WxH YUV420-8bit,
It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
The geometry images and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.
The padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g., 16x16) pixels is compressed independently. If the block is empty (i.e., unoccupied, i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e., occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e., edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding volume, for example a bounding box, 3D location of the patch. For example, the following metadata may be encoded/decoded for every patch:
- index of the projection plane o Index 0 for the planes (1 .0, 0.0, 0.0) and (-1 .0, 0.0, 0.0) o Index 1 for the planes (0.0, 1 .0, 0.0) and (0.0, -1 .0, 0.0) o Index 2 for the planes (0.0, 0.0, 1 .0) and (0.0, 0.0, -1 .0)
- 2D bounding box (uO, vO, ul, vl)
- 3D location (xO, yO, z0) of the patch represented in terms of depth 30, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (50, sO, rO) may be calculated as follows: o Index 0, 30= xO, s0=z0 and rO = y0 o Index 1, 30= yO, s0=z0 and rO = x0 o Index 2, 30= zO, s0=x0 and rO = yO
Also, mapping information providing for each TxT block its associated patch index may be encoded as follows:
- For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.
- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.
- Let I be index of the patch, which the current TxT block belongs to, and let J be the position of I in L. Instead of explicitly coding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency.
The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.
The occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. BO is a configurable parameter. In order to achieve lossless encoding, BO may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.
The compression process may comprise one or more of the following example operations:
• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.
• A binary information may be encoded for each TxT block to indicate whether it is full or not.
• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
■ The binary value of the initial sub-block is encoded.
■ Continuous runs of Os and 1 s are detected, while following the traversal order selected by the encoder.
■ The number of detected runs is encoded.
■ The length of each run, except of the last one, is also encoded.
Figure 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202. In addition, the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204. Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.
The reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202. The texture reconstruction 207 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.
The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (30, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth 3(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
3(u, v) = 30 + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.
For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction. Visual volumetric video-based Coding (V3C) relates to a core part shared between ISO/IEC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video)). V3C will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1-8 of the current V-PCC text). ISO/IEC 23090-12 will refer to this common part. ISO/IEC 23090-5 will be renamed to V3C PCC, ISO/IEC 23090-12 renamed to V3C MIV.
V3C enables the encoding and decoding processes of a variety of volumetric media by using video and image coding technologies. This is achieved through first a conversion of such media from their corresponding 3D representation to multiple 2D representations, also referred to as V3C video components, before coding such information. Such representations may include occupancy, geometry, and attribute components. The occupancy component can inform a V3C decoding and/or rendering system of which samples in the 2D components are associated with data in the final 3D representation. The geometry component contains information about the precise location of 3D data in space, while attribute components can provide additional properties, e.g., texture or material information, of such 3D data. An example is shown in Figures 3a and 3b, where Figure 3a presents volumetric media conversion at an encoder, and where Figure 3b presents volumetric media reconstruction at a decoder side. The 3D media is converted to a series of 2D representations: occupancy 301 , geometry 302, and attributes 303. Additional information may also be included in the bitstream to enable inverse reconstruction.
Additional information that allows associating all these V3C video components, and enables the inverse reconstruction from a 2D representation back to a 3D representation is also included in a special component, referred to in this document as the atlas 304. An atlas 304 consists of multiple elements, named as patches. Each patch identifies a region in all available 2D components and contains information necessary to perform the appropriate inverse projection of this region back to the 3D space. The shape of such regions is determined through a 2D bounding volume associated with each patch as well as their coding order. The shape of these regions is also further refined after the consideration of the occupancy information. Atlases may be partitioned into patch packing blocks of equal size. The 2D bounding volumes of patches and their coding order determine the mapping between the blocks of the atlas image and the patch indices. Figure 4 shows an example of block to patch mapping with 4 projected patches onto an atlas when asps_patch_precedence_order_flag is equal to 0. Projected points are represented with dark grey. The area that does not contain any projected points is represented with light grey. Patch packing blocks are represented with dashed lines. The number inside each patch packing block represents the patch index of the patch to which it is mapped.
Axes orientations are specified for internal operations. For instance, the origin of the atlas coordinates is located on the top-left corner of the atlas frame. For the reconstruction step, an intermediate axes definition for a local 3D patch coordinate system is used. The 3D local patch coordinate system is then converted to the final target 3D coordinate system using appropriate transformation steps.
Figure 5a shows an example of a single patch 520 packed onto an atlas image 510. This patch 520 is then converted to a local 3D patch coordinate system (U, V, D) defined by the projection plane with origin O’, tangent (U), bi-tangent (V), and normal (D) axes. For an orthographic projection, the projection plane is equal to the sides of an axis-aligned 3D bounding volume 530, as shown in Figure 5b. The location of the bounding volume 530 in the 3D model coordinate system, defined by a left-handed system with axes (X, Y, Z), can be obtained by adding offsets TilePatch3dOffsetU, TilePatch3DOffsetV, and TilePatch3DOffsetD, as illustrated in Figure 5c.
Coded V3C video components are referred to in this disclosure as video bitstreams, while a coded atlas is referred to as the atlas bitstream. Video bitstreams and atlas bitstreams may be further split into smaller units, referred to here as video and atlas sub-bitstreams, respectively, and may be interleaved together, after the addition of appropriate delimiters, to construct a V3C bitstream.
V3C patch information is contained in atlas bitstream, atlas_sub_bitstream(), which contains a sequence of NAL units. NAL unit is specified to format data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes. A NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet-oriented transport and sample streams is identical except that in the sample stream format specified in Annex D of ISO/IEC 23090-5 each NAL unit can be preceded by an additional element that specifies the size of the NAL unit.
NAL units in atlas bitstream can be divided to atlas coding layer (ACL) and non-atlas coding layer (non-ACL) units. The former dedicated to carry patch data while the later to carry data necessary to properly parse the ACL units or any additional auxiliary data.
In the nal_unit_header() syntax nal_unit_type specifies the type of the RBSP (Raw Byte Sequence Payload) data structure contained in the NAL unit as specified in Table 4 of ISO/IEC 23090-5. nal_layer_id specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. The value of nal_layer_id shall be in the range of 0 to 62, inclusive. The value of 63 may be specified in the future by ISO/IEC. Decoders conforming to a profile specified in Annex A of ISO/IEC 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of nal_layer_id not equal to 0. rbsp_byte[ i ] is the i-th byte of an RBSP. An RBSP is specified as an ordered sequence of bytes as follows:
The RBSP contains a string of data bits (SODB) as follows:
• If the SODB is empty (i.e., zero bits in length), the RBSP is also empty.
• Otherwise, the RBSP contains the SODB as follows: o The first byte of the RBSP contains the first (most significant, leftmost) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain. o The rbsp_trailing_bits( ) syntax structure is present after the SODB as follows:
■ The first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any).
■ The next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit).
■ When the rbsp_stop_one_bit is not the last bit of a byte- aligned byte, one or more bits equal to 0 (i.e., instances of rbsp_alignment_zero_bit) are present to result in byte alignment.
One or more cabac_zero_word 16-bit syntax elements equal to 0x0000 may be present in some RBSPs after the rbsp_trailing_bits( ) at the end of the RBSP.
Syntax structures having these RBSP properties are denoted in the syntax tables using an "_rbsp" suffix. These structures are carried within NAL units as the content of the rbsp_byte[ i ] data bytes. As an example, the following may be considered as typical content:
• atlas_sequence_parameter_set_rbsp( ), which is used to carry parameters related to atlas on a sequence level.
• atlas_frame_parameter_set_rbsp( ), which is used to carry parameters related to atlas on a frame level and are valid for one or more atlas frames.
• sei_rbsp( ), used to carry SEI (Supplemental Enhancement Information) messages in NAL units.
• atlas_tile_group_layer_rbsp( ), used to carry patch layout information for tile groups.
When the boundaries of the RBSP are known, the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, rightmost) bit equal to 1 , and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0. The data necessary for the decoding process is contained in the SODB part of the RBSP. atlas_tile_group_laye_rbsp() contains metadata information for a list off tile groups, which represent sections of frame. Each tile group may contain several patches for which the metadata syntax is described below.
Figure imgf000021_0002
Annex F of V3C V-PCC specification (23090-5) describes different SEI messages that have been defined for V3C MIV purposes. SEI messages assist in processes related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non- essential. V3C SEI messages are signaled in sei_rspb() which is documented below.
Figure imgf000021_0001
Figure imgf000022_0001
Non-essential SEI messages are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance.
Specification for presence of non-essential SEI messages is also satisfied when those messages (or some subset of them) are conveyed to decoders (or to the HRD) by other means not specified in V3C V-PCC specification (23090- 5). When present in the bitstream, non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5). When the content of a non-essential SEI message is conveyed for the application by some means other than presence within the bitstream, the representation of the content of the SEI message is not required to use the same syntax specified in annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are present in the bitstream are counted.
Essential SEI messages are an integral part of the V3C bitstream and should not be removed from the bitstream. The essential SEI messages are categorized into two types:
• Type-A essential SEI messages: These SEIs contain information required to check bitstream conformance and for output timing decoder conformance. Every V3C decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance.
• Type-B essential SEI messages: V3C decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type- B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes. A polygon mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral object in 3D computer graphics and solid modelling. The faces usually consist of triangles (triangle mesh), quadrilaterals (quads), or other simple convex polygons (n-gons), since this simplifies rendering, but may also be more generally composed of concave polygons, or even polygons with holes. Objects created with polygon meshes are represented by different types of elements. These include vertices, edges, faces, polygons, and surfaces. In many applications, only vertices, edges and either faces or polygons are stored.
Polygon meshes are defined by the following elements:
• Vertex: A position in 3D space defined as (x, y, z) along with other information such as color (r, g, b), normal vector and texture coordinates.
• Edge: A connection between two vertices.
• Face: A closed set of edges, in which a triangle face has three edges, and a quad face has four edges. A polygon is a coplanar set of faces. In systems that support multi-sided faces, polygons and faces are equivalent. Mathematically a polygonal mesh may be considered an unstructured grid, or undirected graph, with additional properties of geometry, shape and topology.
• Surfaces: or smoothing groups, are useful, but not required to group smooth regions.
• Groups: Some mesh formats contain groups, which define separate elements of the mesh, and are useful for determining separate subobjects for skeletal animation or separate actors for non-skeletal animation.
• Materials: defined to allow different portions of the mesh to use different shaders when rendered.
• UV coordinates: Most mesh formats also support some form of llV coordinates which are a separate 2D representation of the mesh "unfolded" to show what portion of a 2-dimensional texture map applies to different polygons of the mesh. It is also possible for meshes to contain other vertex attribute information such as color, tangent vectors, weight maps to control animation, etc. (sometimes also called channels).
Figure 6 and Figure 7 show the extensions to the V3C encoder and decoder to support mesh encoding and mesh decoding.
In the encoder extension, shown in Figure 6, the input mesh data 610 is demultiplexed 620 into vertex coordinate and attributes data 625 and mesh connectivity 627, where the mesh connectivity comprises vertex connectivity information. The vertex coordinate and attributes data 625 is coded using MPEG-I V-PCC 630 (such as shown in Figure 1 ), whereas the mesh connectivity data 627 is coded in mesh connectivity encoder 635 as auxiliary data. Both of these are multiplexed 640 to create the final compressed output bitstream 650. Vertex ordering is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC to reorder the vertices for optimal mesh connectivity encoding.
At the decoder, shown in Figure 7, the input bitstream 750 is demultiplexed 740 to generate the compressed bitstreams for vertex coordinates and attributes data and mesh connectivity. The vertex coordinates and attributes data are decompressed using MPEG-I V-PCC decoder 730. Vertex reordering 725 is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC decoder 730 to match the vertex order at the encoder. Mesh connectivity data is decompressed using mesh connectivity decoder 735. The decompressed data is multiplexed 720 to generate the reconstructed mesh 710.
An edgebreaker is an algorithm for efficient compression of 3D meshes. The edgebreaker encodes the connectivity of the triangle meshes. Because of the performance and simplicity of edgebreaker, it has been adopted in popular compression libraries.
As an example, edgebreaker is at the core of the of the Google Draco compression library. Google Draco is an open-source library for compression and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics. The algorithm traverses through all triangles all triangles of the mesh in a deterministic, spiral-like way, where:
- Each new triangle is next to an already encoded one. This allows efficient compression of vertex coordinates and other attributes, such as normals. Instead of storing the absolute values, they can be predicted them from an adjacent triangle (using parallelogram prediction) and only store the difference between predicted and actual values, which is generally small.
- Each triangle encodes minimum information necessary to reconstruct mesh connectivity from the sequence. For simpler meshes, edgebreaker uses just five possible symbols (“C”, “L”, “E”, “R”, “S”) per triangle (forming the so called CLERS string).
- Because half of the descriptors are Cs, a trivial code (C=0, L=110, E=111 , R=101 , S=100) guarantees an upper bound of 2 bits per triangle.
Encoding vertex position, or vertex position encoded using parallelogram prediction, as raw data of a video frame has a couple of drawbacks:
- video coding tools are not suited for compression of such data which contains high spatial frequencies;
- video coding tools may introduce an error that is small in 2D video domain, but may have significant impact on 3D reconstruction.
It has also been identified that the data that contribute the majority of bitrate in the compressed stream (>90%) is texture information. It is, therefore, feasible to use non-video coding tools to encode the vertex position with small impact on the overall compression.
The present embodiments are targeted to the concept of mesh-patch, meshpatch encoding using edgebreaker, and related in signalling in V3C.
Mesh-patches are new concept compared to geometry patches known from V3C, as they:
- carry vertex connectivity information; - do not require additional occupancy signalling, thus reducing the need of an occupancy video stream;
- can carry additional information, such as surface normals, without the need of additional video streams;
- in a combined approach, mesh-patches can be signaled together with geometry patches, utilizing the best of both worlds.
The present embodiments provide a method to encode geometry, occupancy, and connectivity information in a mesh-patch. The concept of V3C patch creation, as described above, remains untouched, i.e., mesh-patches and geometry patches describe the same data. However, mesh-patches provide improved coding efficiency and additional support of important features required for dynamic mesh compression.
A geometry patch is a grayscale image together with some patch metadata (atlas data). The pixel u,v coordinates, the pixel value, and the metadata, allow for 3D reconstruction. Additional information is required to specify if a pixel of the geometry patch is valid. For the use case of mesh compression, there is no approach for carrying vertex connectivity information.
A mesh-patch is the raw data of the same 3D reconstruction, as well as the same patch metadata. However, as patches form a 3D->2D projection, the raw data can be significantly simplified:
- The patch metadata can be used to reduce the entropy in X, Y, and Z vertex values, thus improving coding efficiency;
- There is no need to signal additional occupancy information, as the occupancy is explicitly coded in edgebreaker, thus improving coding efficiency and reducing complexity;
- Compared to Full Mesh edgebreaker compression, there is no need to signal vertex texture coordinates, as these are identical with the X and Y coordinates coded for the vertex position. This improves coding efficiency and reduces complexity.
ENCODER According to an embodiment, an encoder receives a dynamic 3D mesh, and subdivides the mesh into smaller meshes or similar subareas. As described, a mesh is a collection of vertices, edges and faces that define a shape of a 3D object. Therefore, the smaller meshes being divided from the received 3D mesh, are also collections of vertices, edges and faces. The encoder then projects these subareas onto 2D planes to form a texture image and an initial geometry image, and also the related patch metadata.
The initial geometry image has the same resolution as the texture image, but has a sparse distribution. Only pixel position with a projected 3D vertex carries a value. The value z at this pixel position (u, v), together with the position X, Y, Z of the patch in 3D space (carried in the patch metadata) enables reconstructing each vertex back into the 3D space, at position Xv, Yv, Zv as follows:
Xv=X + u
Yv=Y + v
Zv=Z + z
According to an embodiment, the reconstructed vertices are coded using an edgebreaker and then multiplexed in a V3C bitstream. In this embodiment, it is not necessary to carry X, Y, Z in the patch metadata. However, the texture coordinates u, v required for the texture mapping need to be included in the edgebreaker data.
According to another embodiment, the reconstructed vertices are coded using edgebreaker and then multiplexed in a V3C bitstream. Patch location X and Y is carried in the patch metadata. This way, texture coordinates u, v can be derived from the 3D vertices as follows: u = Xv - X; v = Yv - Y.
According to an embodiment, instead of the actual vertex positions, u, v, and z value are encoded for each vertex. X, Y, Z are carried in the patch metadata. Thus, the value ranges to be encoded in the edgebreaker can be significantly reduced, thus improving coding efficiency. It is appreciated that the subdivision of a 3D mesh into smaller meshes or similar data already improves coding efficiency as entropy of the encoded data is reduced.
According to another embodiment, additional information, such as surface normals can be included in the edgebreaker coded information.
According to yet another embodiment, the encoder creates both geometry patches and mesh-patches. In such case, mesh-patches can be simplified, only carrying u and v coordinates, as the value for z is given from the corresponding pixel value zg in the geometry patch. This approach improves coding efficiency of the edgebreaker by about 20 percent at the cost of additional geometry video signalling.
According to another embodiment, the V3C mesh encoder stores edgebreaker data as follows:
- for each mesh-patch at a given time instance, encoder uses edgebreaker to generate order of vertices (with associated uv attributes) and CLERS string providing connectivity information;
It should be noted that vertices may or may be not encoded with parallelogram prediction.
- each vertex is stored as V3C patch data of size 1x1 ;
- patches in an intra frame (i.e., first encoded frame or a frame that all data from latter frames should reference to) are stored as V3 patch data units (patch_data_unit syntax structure in ISO/IEC 23090-5) where semantics of the patch data is as follows: o pdu_2d_pos_x stores the u value of texture map to the vertex; o pdu_2d_pos_y stores the v value of texture map to the vertex; o pdu_2d_size_x_minus1 equal to 0, ue(v) equal to 1 bit; o pdu_2d_size_y_minus1 equal to 0 ue(v) equal to 1 bit; o pdu_3d_offset_u stores x value of a vertex; o pdu_3d_offset_v stores y value of a vertex; o pdu_3d_offset_d stores z value of a vertex; o pdu_projection_id - set to 0 bits as descriptor is u(v); o pud_orientation_index -set to 0 bits as descriptor u(v). - Patches in an inter frame (i.e., a frame that has a reference to other frame) are stored as any type of V3C patch data units (i.e., patch_data_unit, skip_patch_data_unit, merge_patch_data_unit, or inter_patch_data_unit) with semantics aligned to semantics described for patch_data_unit in previous bulletpoint.
- CLERS strings, i.e., connectivity information, are stored as: o a new patch type that would store a string of bits, the string representing the history, e.g., the following bit mapping to history string could be used C=0, L=110, E=111 , R=101 , S=100 pattern, or o a new NAL unit type that would always precede NAL unit containing the corresponding atlas_tile_layer_rbsp syntax element.
Initial performance values:
Figure imgf000029_0001
As can be seen from the initial experimental performance values, almost the same bit rates can be achieved as required for geometry patches, with the added benefit that no occupancy signalling is required, and the connectivity information is fully preserved.
DECODER According to an embodiment, a decoder receives and decodes a bitstream comprising atlas metadata, a texture video bitstream and a sequence of edgebreaker bitstreams. Each edgebreaker bitstream corresponds to one patch in the texture video bitstream.
The decoder recreates the 3D vertex positions and texture mapping coordinates in each edgebreaker bitstream, potentially with additional information from the atlas/patch metadata. The vertex connectivity information is directly decoded from the edgebreaker information.
The decoder reconstructs the mesh according to the decoded vertices position, connectivity, and texture mapping information.
SIGNALLING
Examples for syntax elements and semantics are presented in the following, and they can be included in the V3C standard.
Patch data unit syntax:
Figure imgf000030_0001
Figure imgf000031_0001
pdu_3d_eb_enc_flag indicates if the edgebreaker information is present for geometry reconstruction of patch. pdu_eb_bs_idx[ tilelD ][ patch Idx ] indicates the bitstream index of the edgebreaker bitstream corresponding to patch patchldx of Tile tilelD.
According to another embodiment, the edgebreaker data is carried directly in the V3C atlas data as follows:
Patch data unit syntax:
Figure imgf000031_0002
Figure imgf000032_0001
pdu_eb_enc_type indicates a method to encode history CLERS string. For example, the parameter pdu_eb_enc_type equal to 0 indicates that the following bit mapping history string is used C=0, L=110, E=111 , R=101 , S=100 pattern. pdu_eb_clers_data_length indicates the length of data containing the CLERS string in bits. It is appreciated that new patch type would have very similar syntax structure edgebreaker_rbsp syntax structure but the trailing bits and for brevity it is not provided.
The method for encoding according to an embodiment is shown in Figure 8. The method generally comprises receiving 805 a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; subdividing 810 the dynamic 3D mesh into subareas; projecting 815 the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; encoding 8203D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, encoding 825 connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and transmitting 830 the bitstreams to a decoder. Each of the steps can be implemented by a respective module of a computer system.
An apparatus according to an embodiment comprises means for receiving a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces; means for subdividing the dynamic 3D mesh into subareas; means for projecting the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions; means for encoding 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream; for one or more patches in a texture video bitstream, means for encoding connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and means for transmitting the bitstreams to a decoder. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 8 according to various embodiments.
The method for decoding according to an embodiment is shown in Figure 9. The method generally comprises receiving 940 an encoded V3C bitstream, comprising patch metadata and texture video bitstream; receiving 945 one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; decoding 950 from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; decoding 955 vertex connectivity information from the bitstream; and reconstructing 960 a mesh according to decoded vertices position, vertex connectivity information and texture mapping information. Each of the steps can be implemented by a respective module of a computer system.
An apparatus according to an embodiment comprises means for receiving an encoded V3C bitstream, comprising patch metadata and texture video bitstream; means for receiving one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; means for decoding from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata; means for decoding vertex connectivity information from the bitstream; and means for reconstructing a mesh according to decoded vertices position, vertex connectivity information and texture mapping information. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 9 according to various embodiments.
An example of an apparatus is disclosed with reference to Figure 10. Figure 10 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device. The apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a IIICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system, or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection. The various embodiments may provide advantages. For example, the present embodiments may improve coding efficiency. Also, the present embodiment may reduce decoding complexity (no additional occupancy data). Also, the present embodiments may support for essential mesh features such as connectivity signalling, surface normals, etc.
The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.
Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:
1 . An apparatus for encoding comprising
- means for receiving a dynamic three-dimensional (3D) mesh representing a 3D object, and being formed of vertices, edges and faces;
- means for subdividing the dynamic 3D mesh into subareas;
- means for projecting the subareas onto two-dimensional (2D) planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions;
- means for encoding 3D vertex positions, texture video bitstream and patch metadata to a Visual Volumetric Video-based Coding (V3C) bitstream;
- for one or more patches in a texture video bitstream, means for encoding connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and
- means for transmitting the bitstreams to a decoder.
2. The apparatus according to claim 1 , further comprising means for encoding partial information on vertex 3D positions into a bitstream, where partial information comprises one or more of the following:
- absolute two out of 3 coordinates in 3D space;
- relative residual 3D coordinates to 3D coordinates specified in the patch metadata;
- 2D texture coordinates corresponding to an associated texture map.
3. The apparatus according to claim 1 or 2, further comprising means for encoding information on surface normals of a patch into a bitstream.
4. The apparatus according to claim 1 or 2 or 3, further comprising means for creating a geometry patch and a mesh-patch, wherein a mesh-patch comprises texture coordinates and a geometry patch comprises a 3D coordinate value being represented in said texture coordinates.
5. The apparatus according to any of the claims 1 to 4, further comprising means for signaling the connectivity of 3D vertices represented in the patch as additional metadata in or along the V3C atlas data.
6. An apparatus for decoding
- means for receiving an encoded V3C bitstream, comprising patch metadata and texture video bitstream;
- means for receiving one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh;
- means for decoding from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata;
- means for decoding vertex connectivity information from the bitstream; and
- means for reconstructing a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
7. The apparatus according to claim 6, further comprising means for decoding information on surface normals of a patch from a bitstream.
8. The apparatus according to claim 6 or 7, further comprising means for decoding texture coordinates from a mesh-patch and means for decoding a value being represented in said texture coordinates from a geometry patch.
9. The apparatus according to any of the claims 6 to 8, further comprising means for decoding the connectivity of 3D vertices represented in the patch from metadata in or along the V3C atlas data.
10. A method for encoding, comprising:
- receiving a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces;
- subdividing the dynamic 3D mesh into subareas; - projecting the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions;
- encoding 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream;
- for one or more patches in a texture video bitstream, encoding connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and
- transmitting the bitstreams to a decoder.
11. The method according to claim 10, further comprising encoding partial information on vertex 3D positions into a bitstream, where partial information comprises one or more of the following:
- absolute two out of 3 coordinates in 3D space;
- relative residual 3D coordinates to 3D coordinates specified in the patch metadata;
- 2D texture coordinates corresponding to an associated texture map.
12. The method according to claim 10 or 11 , further comprising encoding information on surface normals of a patch into a bitstream
13. The method according to claim 10 or 11 or 12, further comprising creating a geometry patch and a mesh-patch, wherein a mesh-patch comprises texture coordinates and a geometry patch comprises a 3D coordinate value being represented in said texture coordinates.
14. The method according to any of the claims 10 to 13, further comprising signaling the connectivity of 3D vertices represented in the patch as additional metadata in or along the V3C atlas data.
15. A method for decoding, comprising
- receiving an encoded V3C bitstream, comprising patch metadata and texture video bitstream; - receiving one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh;
- decoding from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata;
- decoding vertex connectivity information from the bitstream; and
- reconstructing a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
16. The method according to claim 15, further comprising decoding information on surface normals of a patch from a bitstream.
17. The method according to claims 15 or 16, further comprising decoding texture coordinates from a mesh-patch and decoding a value being represented in said texture coordinates from a geometry patch.
18. The method according any of the claims 15 to 17, further comprising decoding the connectivity of 3D vertices represented in the patch from metadata in or along the V3C atlas data.
19. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:
- receive a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces;
- subdivide the dynamic 3D mesh into subareas;
- project the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions;
- encode 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream;
- for one or more patches in a texture video bitstream, encode connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and
- transmit the bitstreams to a decoder. The apparatus according to claim 19, further comprising computer program code configured to cause the apparatus to encode partial information on vertex 3D positions into a bitstream, where partial information comprises one or more of the following:
- absolute two out of 3 coordinates in 3D space;
- relative residual 3D coordinates to 3D coordinates specified in the patch metadata;
- 2D texture coordinates corresponding to an associated texture map. The apparatus according to claim 19 or 20, further comprising computer program code configured to cause the apparatus to encode information on surface normals of a patch into a bitstream. The apparatus according to claim 19 or 20 or 21 , further comprising computer program code configured to cause the apparatus to create a geometry patch and a mesh-patch, wherein a mesh-patch comprises texture coordinates and a geometry patch comprises a 3D coordinate value being represented in said texture coordinates. The apparatus according to any of the claims 19 to 22, further comprising computer program code configured to cause the apparatus to signal the connectivity of 3D vertices represented in the patch as additional metadata in or along the V3C atlas data. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:
- receive an encoded V3C bitstream, comprising patch metadata and texture video bitstream;
- receive one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh; - decode from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata;
- decode vertex connectivity information from the bitstream; and
- reconstruct a mesh according to decoded vertices position, vertex connectivity information and texture mapping information. The apparatus according to claim 24, further comprising computer program code configured to cause the apparatus to decode information on surface normals of a patch from a bitstream. The apparatus according to claim 24 or 25, further comprising computer program code configured to cause the apparatus to decode texture coordinates from a mesh-patch and means for decoding a value being represented in said texture coordinates from a geometry patch. The apparatus according to any of the claims 24 to 26, further comprising computer program code configured to cause the apparatus to decode the connectivity of 3D vertices represented in the patch from metadata in or along the V3C atlas data. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:
- receive a dynamic 3D mesh representing a 3D object, and being formed of vertices, edges and faces;
- subdivide the dynamic 3D mesh into subareas;
- project the subareas onto 2D planes, wherein a pixel position on a 2D plane with a projected 3D vertex comprises a value for enabling determining 3D vertex positions;
- encode 3D vertex positions, texture video bitstream and patch metadata to a V3C bitstream;
- for one or more patches in a texture video bitstream, encode connectivity of 3D vertices represented in the patch using an algorithm for compressing a mesh into a respective bitstreams; and
- transmit the bitstreams to a decoder.
29. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:
- receive an encoded V3C bitstream, comprising patch metadata and texture video bitstream;
- receive one or more bitstreams, each bitstream corresponding to a patch in a texture video bitstream, the bitstreams being encoded using an algorithm for compressing a mesh;
- decode from the received bitstreams 3D vertex positions, texture video bitstream and information from a patch metadata;
- decode vertex connectivity information from the bitstream; and
- reconstruct a mesh according to decoded vertices position, vertex connectivity information and texture mapping information.
30. The computer program product according to claim 28 or 29, wherein the computer program product is embodied on a non-transitory computer readable medium.
PCT/FI2023/050045 2022-01-27 2023-01-19 A method, an apparatus and a computer program product for video encoding and video decoding WO2023144445A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20225069 2022-01-27
FI20225069 2022-01-27

Publications (1)

Publication Number Publication Date
WO2023144445A1 true WO2023144445A1 (en) 2023-08-03

Family

ID=85150860

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2023/050045 WO2023144445A1 (en) 2022-01-27 2023-01-19 A method, an apparatus and a computer program product for video encoding and video decoding

Country Status (1)

Country Link
WO (1) WO2023144445A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024074961A1 (en) * 2022-10-06 2024-04-11 Sony Group Corporation Orthoatlas: texture map generation for dynamic meshes using orthographic projections

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200286261A1 (en) * 2019-03-07 2020-09-10 Samsung Electronics Co., Ltd. Mesh compression
US20210090301A1 (en) * 2019-09-24 2021-03-25 Apple Inc. Three-Dimensional Mesh Compression Using a Video Encoder
US20210295566A1 (en) * 2020-03-18 2021-09-23 Sony Corporation Projection-based mesh compression
WO2022074515A1 (en) * 2020-10-06 2022-04-14 Sony Group Corporation Video based mesh compression

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200286261A1 (en) * 2019-03-07 2020-09-10 Samsung Electronics Co., Ltd. Mesh compression
US20210090301A1 (en) * 2019-09-24 2021-03-25 Apple Inc. Three-Dimensional Mesh Compression Using a Video Encoder
US20210295566A1 (en) * 2020-03-18 2021-09-23 Sony Corporation Projection-based mesh compression
WO2022074515A1 (en) * 2020-10-06 2022-04-14 Sony Group Corporation Video based mesh compression

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Information technology - Coded representation of immersive media - Part 5: Visual volumetric video-based coding (V3C) and video-based point cloud compression (V-PCC)", 18 June 2021 (2021-06-18), pages 1 - 331, XP082026977, Retrieved from the Internet <URL:https://api.iec.ch/harmonized/publications/download/1101252> [retrieved on 20210618] *
BOYCE JILL M. ET AL: "MPEG Immersive Video Coding Standard", PROCEEDINGS OF THE IEEE, 1 January 2021 (2021-01-01), US, pages 1 - 16, XP055824808, ISSN: 0018-9219, DOI: 10.1109/JPROC.2021.3062590 *
DANILLO GRAZIOSI (SONY) ET AL: "[V-PCC][EE2.6-related] Mesh Patch Data", no. m55368, 7 October 2020 (2020-10-07), XP030292889, Retrieved from the Internet <URL:https://dms.mpeg.expert/doc_end_user/documents/132_OnLine/wg11/m55368-v1-m55368_mesh_patch_data.zip m55368_mesh_patch_data.docx> [retrieved on 20201007] *
FARAMARZI ESMAEIL ET AL: "Mesh Coding Extensions to MPEG-I V-PCC", 2020 IEEE 22ND INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 21 September 2020 (2020-09-21), pages 1 - 5, XP055837185, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/stampPDF/getPDF.jsp?tp=&arnumber=9287057&ref=aHR0cHM6Ly9zY2hvbGFyLmdvb2dsZS5jb20v> DOI: 10.1109/MMSP48831.2020.9287057 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024074961A1 (en) * 2022-10-06 2024-04-11 Sony Group Corporation Orthoatlas: texture map generation for dynamic meshes using orthographic projections

Similar Documents

Publication Publication Date Title
US11509933B2 (en) Method, an apparatus and a computer program product for volumetric video
US12101457B2 (en) Apparatus, a method and a computer program for volumetric video
US11711535B2 (en) Video-based point cloud compression model to world signaling information
WO2021260266A1 (en) A method, an apparatus and a computer program product for volumetric video coding
WO2021191495A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
EP4399877A1 (en) An apparatus, a method and a computer program for volumetric video
US12108082B2 (en) Method, an apparatus and a computer program product for volumetric video encoding and video decoding
WO2023144445A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2021205068A1 (en) A method, an apparatus and a computer program product for volumetric video coding
WO2021170906A1 (en) An apparatus, a method and a computer program for volumetric video
US20220159297A1 (en) An apparatus, a method and a computer program for volumetric video
US20230171427A1 (en) Method, An Apparatus and a Computer Program Product for Video Encoding and Video Decoding
EP3987774A1 (en) An apparatus, a method and a computer program for volumetric video
WO2019211519A1 (en) A method and an apparatus for volumetric video encoding and decoding
EP4443880A1 (en) A method, an apparatus and a computer program product for encoding and decoding of volumetric media content
WO2023047021A2 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2023144439A1 (en) A method, an apparatus and a computer program product for video coding
US20230326138A1 (en) Compression of Mesh Geometry Based on 3D Patch Contours
US20230298218A1 (en) V3C or Other Video-Based Coding Patch Correction Vector Determination, Signaling, and Usage
WO2024012765A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2023001623A1 (en) V3c patch connectivity signaling for mesh compression
WO2024209129A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2023175243A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2022258879A2 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2024170819A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23702639

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2023702639

Country of ref document: EP

Effective date: 20240827