WO2023144439A1 - A method, an apparatus and a computer program product for video coding - Google Patents

A method, an apparatus and a computer program product for video coding Download PDF

Info

Publication number
WO2023144439A1
WO2023144439A1 PCT/FI2022/050834 FI2022050834W WO2023144439A1 WO 2023144439 A1 WO2023144439 A1 WO 2023144439A1 FI 2022050834 W FI2022050834 W FI 2022050834W WO 2023144439 A1 WO2023144439 A1 WO 2023144439A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
bitstream
bitstreams
file
geometry
Prior art date
Application number
PCT/FI2022/050834
Other languages
French (fr)
Inventor
Lauri Aleksi ILOLA
Lukasz Kondrad
Emre Baris Aksu
Kashyap KAMMACHI SREEDHAR
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2023144439A1 publication Critical patent/WO2023144439A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/001Model-based coding, e.g. wire frame
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring

Definitions

  • the present solution generally relates to video encoding and video decoding.
  • Volumetric video data represents a three-dimensional (3D) scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality), and MR (Mixed Reality) applications.
  • Such data describes geometry (Shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), and any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video).
  • Volumetric video can be generated from 3D models, also referred to as volumetric visual objects, i.e., CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more.
  • CGI Computer Generated Imagery
  • volumetric data comprises triangle meshes, point clouds, or voxels.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.
  • volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.
  • an apparatus comprising means for receiving media data comprising three-dimensional models being formed of meshes or point clouds; means for compressing a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and means for storing the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
  • a method comprising: receiving media data comprising three-dimensional models being formed of meshes or point clouds; compressing a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and storing the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive media data comprising three- dimensional models being formed of meshes or point clouds; compress a three- dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and store the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
  • computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive media data comprising three- dimensional models being formed of meshes or point clouds; compress a three- dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and store the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
  • a file contains bitstreams being compressed with the algorithm for mesh compression.
  • one or more of the following boxes relating to the compressed geometry bitstreams are included into the box-structured filed format:
  • the algorithm is a Draco compression algorithm.
  • a file comprises one or more tracks with geometry bitstream and zero or more related texture tracks.
  • the file comprises one or more tracks with connectivity sub-bitstreams, one or more tracks with attribute sub-bitstreams, and zero or more related texture tracks.
  • the file is an ISOBMFF file.
  • timed bitstream is stored in tracks in the file.
  • non-timed bitstream is stored as items in the file.
  • the computer program product is embodied on a non- transitory computer readable medium.
  • Fig. 1 shows an example of a compression process of a volumetric video
  • Fig. 2 shows an example of a de-compression of a volumetric video
  • Fig. 3a shows an example of a volumetric media conversion at an encoder
  • Fig. 3b shows an example of a volumetric media reconstruction at a decoder
  • Fig. 4 shows an example of block to patch mapping
  • Fig. 5a shows an example of an atlas coordinate system
  • Fig. 5b shows an example of a local 3D patch coordinate system
  • Fig. 5c shows an example of a final target 3D coordinate system
  • Fig. 6 shows a V-PCC extension for mesh encoding
  • Fig. 7 shows a V-PCC extension for mesh decoding
  • Fig. 8 shows a simplified example of rendering pipeline
  • Fig. 9 shows a structure of compressed Draco bitstream
  • Fig. 10 shows an example of Draco bitstream structure
  • Fig. 11 shows an example of single stream Draco track
  • Fig. 12 shows an example of single-stream encapsulation with sub-samples
  • Fig. 13 shows an example of multi-track encapsulation of Draco bitstream
  • Fig. 14 shows an example of encapsulation of static Draco bitstream as item
  • Fig. 15 is a flowchart illustrating a method according to an embodiment.
  • Fig. 16 shows an apparatus according to an embodiment.
  • Volumetric video data represents a three-dimensional scene or object and can be used as input for AR, VR and MR applications. Such data describes geometry (shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), plus any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video).
  • Volumetric video is either generated from 3D models, i.e., CGI, or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Representation formats for such volumetric data are triangle meshes, point clouds, or voxels.
  • Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time. Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.
  • 3D data acquisition devices have enabled reconstruction of highly detailed volumetric video representations or natural scenes.
  • Infrared, lasers, time-of-flight, and structured light are all examples of devices that can be used to construct 3D video data.
  • Representation of the 3D data depends on how the 3D data is used.
  • Dense Voxel arrays have been used to represent volumetric medical data.
  • polygonal meshes are extensively used.
  • Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold.
  • Another way to represent 3D data is coding this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multilevel surface maps.
  • Visual volumetric video comprising a sequence of visual volumetric frames, if uncompressed, may be represented by a large amount of data, which can be costly in terms of storage and transmission. This has led to the need for a high coding efficiency standard for the compression of visual volumetric data.
  • Figure 1 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC).
  • PCC MPEG Point Cloud Coding
  • the process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
  • the patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error.
  • the normal at every point can be estimated.
  • An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:
  • each point may be associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal and the plane normal).
  • the initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors.
  • the final step may comprise extracting patches by applying a connected component extraction procedure.
  • Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105.
  • the packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch.
  • T may be a user-defined parameter.
  • Parameter T may be encoded in the bitstream and sent to the decoder.
  • W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded.
  • the patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
  • the geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively.
  • the image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images.
  • each patch may be projected onto two images, referred to as layers.
  • H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v).
  • the first layer also called a near layer, stores the point of H(u, v) with the lowest depth DO.
  • the second layer referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+A], where is a user-defined parameter that describes the surface thickness.
  • the generated videos may have the following characteristics:
  • the geometry video is monochromatic.
  • the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
  • the geometry images and the texture images may be provided to image padding 107.
  • the image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images.
  • the occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively.
  • the occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.
  • the padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression.
  • each block of TxT e.g., 16x16 pixels is compressed independently. If the block is empty (i.e., unoccupied, i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e., occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e., edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.
  • the padded geometry images and padded texture images may be provided for video compression 108.
  • the generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters.
  • the video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102.
  • the smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
  • the patch may be associated with auxiliary information being encoded/decoded for each patch as metadata.
  • the auxiliary information may comprise index of the projection plane, 2D bounding volume, for example a bounding box, 3D location of the patch.
  • Metadata may be encoded/decoded for every patch:
  • mapping information providing for each TxT block its associated patch index may be encoded as follows:
  • L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block.
  • the order in the list is the same as the order used to encode the 2D bounding boxes.
  • L is called the list of candidate patches.
  • the occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • One cell of the 2D grid produces a pixel during the image generation process.
  • the occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0).
  • the remaining blocks may be encoded as follows:
  • the occupancy map can be encoded with a precision of a BOxBO blocks.
  • the compression process may comprise one or more of the following example operations:
  • Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block.
  • a value 1 associated with a sub-block if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
  • a binary information may be encoded for each TxT block to indicate whether it is full or not.
  • an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
  • FIG. 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC).
  • a de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202.
  • the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204.
  • Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information.
  • the point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.
  • the reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts.
  • the implemented approach moves boundary points to the centroid of their nearest neighbors.
  • the smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202.
  • the texture reconstruction 207 outputs a reconstructed point cloud.
  • the texture values for the texture reconstruction are directly read from the texture images.
  • the point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers.
  • the 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (30, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth 3(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
  • the texture values can be directly read from the texture images.
  • the result of the decoding process is a 3D point cloud reconstruction.
  • V3C Visual volumetric video-based Coding
  • ISO/IEC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video)).
  • V3C will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1 -8 of the current V-PCC text).
  • ISO/IEC 23090-12 will refer to this common part.
  • ISO/IEC 23090-5 will be renamed to V3C PCC, ISO/IEC 23090-12 renamed to V3C MIV.
  • V3C enables the encoding and decoding processes of a variety of volumetric media by using video and image coding technologies. This is achieved through first a conversion of such media from their corresponding 3D representation to multiple 2D representations, also referred to as V3C video components, before coding such information.
  • Such representations may include occupancy, geometry, and attribute components.
  • the occupancy component can inform a V3C decoding and/or rendering system of which samples in the 2D components are associated with data in the final 3D representation.
  • the geometry component contains information about the precise location of 3D data in space, while attribute components can provide additional properties, e.g., texture or material information, of such 3D data.
  • Figures 3a and 3b An example is shown in Figures 3a and 3b, where Figure 3a presents volumetric media conversion at an encoder, and where Figure 3b presents volumetric media reconstruction at a decoder side.
  • the 3D media is converted to a series of 2D representations: occupancy 301 , geometry 302, and attributes 303. Additional information may also be included in the bitstream to enable inverse reconstruction.
  • An atlas 304 consists of multiple elements, named as patches. Each patch identifies a region in all available 2D components and contains information necessary to perform the appropriate inverse projection of this region back to the 3D space. The shape of such regions is determined through a 2D bounding volume associated with each patch as well as their coding order. The shape of these regions is also further refined after the consideration of the occupancy information.
  • Atlases may be partitioned into patch packing blocks of equal size.
  • the 2D bounding volumes of patches and their coding order determine the mapping between the blocks of the atlas image and the patch indices.
  • Figure 4 shows an example of block to patch mapping with 4 projected patches onto an atlas when asps_patch_precedence_order_flag is equal to 0. Projected points are represented with dark grey. The area that does not contain any projected points is represented with light grey. Patch packing blocks are represented with dashed lines. The number inside each patch packing block represents the patch index of the patch to which it is mapped.
  • Axes orientations are specified for internal operations. For instance, the origin of the atlas coordinates is located on the top-left corner of the atlas frame. For the reconstruction step, an intermediate axes definition for a local 3D patch coordinate system is used. The 3D local patch coordinate system is then converted to the final target 3D coordinate system using appropriate transformation steps.
  • Figure 5a shows an example of a single patch 520 packed onto an atlas image 510.
  • This patch 520 is then converted to a local 3D patch coordinate system (U, V, D) defined by the projection plane with origin O’, tangent (U), bi-tangent (V), and normal (D) axes.
  • the projection plane is equal to the sides of an axis-aligned 3D bounding volume 530, as shown in Figure 5b.
  • the location of the bounding volume 530 in the 3D model coordinate system can be obtained by adding offsets TilePatch3dOffsetU, TilePatch3DOffsetV, and TilePatch3DOffsetD, as illustrated in Figure 5c.
  • Coded V3C video components are referred to in this disclosure as video bitstreams, while a coded atlas is referred to as the atlas bitstream.
  • Video bitstreams and atlas bitstreams may be further split into smaller units, referred to here as video and atlas sub-bitstreams, respectively, and may be interleaved together, after the addition of appropriate delimiters, to construct a V3C bitstream.
  • V3C patch information is contained in atlas bitstream, atlas_sub_bitstream(), which contains a sequence of NAL units.
  • NAL unit is specified to format data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes.
  • a NAL unit specifies a generic format for use in both packet-oriented and bitstream systems.
  • the format of NAL units for both packet-oriented transport and sample streams is identical except that in the sample stream format specified in Annex D of ISO/IEC 23090-5 each NAL unit can be preceded by an additional element that specifies the size of the NAL unit.
  • NAL units in atlas bitstream can be divided to atlas coding layer (ACL) and non-atlas coding layer (non-ACL) units.
  • ACL atlas coding layer
  • non-ACL non-atlas coding layer
  • nal_unit_type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 4 of ISO/IEC 23090-5.
  • nal_layer_id specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies.
  • the value of nal_layer_id shall be in the range of 0 to 62, inclusive.
  • the value of 63 may be specified in the future by ISO/IEC.
  • Decoders conforming to a profile specified in Annex A of ISO/IEC 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of nal_layer_id not equal to 0.
  • rbsp_byte[ i ] is the i-th byte of an RBSP.
  • An RBSP is specified as an ordered sequence of bytes as follows:
  • the RBSP contains a string of data bits (SODB) as follows:
  • the RBSP is also empty.
  • the RBSP contains the SODB as follows: o
  • the first byte of the RBSP contains the first (most significant, left-most) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain.
  • o The rbsp_trailing_bits( ) syntax structure is present after the SODB as follows: ⁇
  • the first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any).
  • the next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit).
  • One or more cabac_zero_word 16-bit syntax elements equal to 0x0000 may be present in some RBSPs after the rbsp_trailing_bits( ) at the end of the RBSP.
  • Syntax structures having these RBSP properties are denoted in the syntax tables using an "_rbsp" suffix. These structures are carried within NAL units as the content of the rbsp_byte[ i ] data bytes. As an example, the following may be considered as typical content:
  • Atlas_frame_parameter_set_rbsp( ) which is used to carry parameters related to atlas on a frame level and are valid for one or more atlas frames.
  • the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, right-most) bit equal to 1 , and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0.
  • the data necessary for the decoding process is contained in the SODB part of the RBSP.
  • atlas_tile_group_laye_rbsp() contains metadata information for a list off tile groups, which represent sections of frame. Each tile group may contain several patches for which the metadata syntax is described below.
  • Annex F of V3C V-PCC specification (23090-5) describes different SEI messages that have been defined for V3C MIV purposes. SEI messages assist in processes related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non-essential. V3C SEI messages are signaled in sei_rspb() which is documented below.
  • Non-essential SEI messages are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance. Specification for presence of non-essential SEI messages is also satisfied when those messages (or some subset of them) are conveyed to decoders (or to the HRD) by other means not specified in V3C V-PCC specification (23090-5).
  • non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5).
  • the representation of the content of the SEI message is not required to use the same syntax specified in annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are present in the bitstream are counted.
  • Essential SEI messages are an integral part of the V3C bitstream and should not be removed from the bitstream.
  • the essential SEI messages are categorized into two types:
  • Type-A essential SEI messages These SEIs contain information required to check bitstream conformance and for output timing decoder conformance. Every V3C decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance.
  • Type-B essential SEI messages V3C decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type-B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes.
  • a polygon mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral object in 3D computer graphics and solid modelling.
  • the faces usually consist of triangles (triangle mesh), quadrilaterals (quads), or other simple convex polygons (n-gons), since this simplifies rendering, but may also be more generally composed of concave polygons, or even polygons with holes.
  • Objects created with polygon meshes are represented by different types of elements. These include vertices, edges, faces, polygons, and surfaces. In many applications, only vertices, edges and either faces or polygons are stored.
  • Polygon meshes are defined by the following elements:
  • Vertex A position in 3D space defined as (x, y, z) along with other information such as color (r, g, b), normal vector and texture coordinates.
  • Edge A connection between two vertices.
  • Face A closed set of edges, in which a triangle face has three edges, and a quad face has four edges.
  • a polygon is a coplanar set of faces. In systems that support multi-sided faces, polygons and faces are equivalent. Mathematically a polygonal mesh may be considered an unstructured grid, or undirected graph, with additional properties of geometry, shape, and topology.
  • Groups Some mesh formats contain groups, which define separate elements of the mesh, and are useful for determining separate sub-objects for skeletal animation or separate actors for non-skeletal animation.
  • UV coordinates Most mesh formats also support some form of UV coordinates which are a separate 2D representation of the mesh "unfolded" to show what portion of a 2-dimensional texture map applies to different polygons of the mesh. It is also possible for meshes to contain other vertex attribute information such as color, tangent vectors, weight maps to control animation, etc. (sometimes also called channels).
  • Figure 6 and Figure 7 show the extensions to the V3C encoder and decoder to support mesh encoding and mesh decoding.
  • the input mesh data 610 is demultiplexed 620 into vertex coordinate and attributes data 625 and mesh connectivity 627, where the mesh connectivity comprises vertex connectivity information.
  • the vertex coordinate and attributes data 625 is coded using MPEG-I V-PCC 630 (such as shown in Figure 1 ), whereas the mesh connectivity data 627 is coded in mesh connectivity encoder 635 as auxiliary data. Both of these are multiplexed 640 to create the final compressed output bitstream 650. Vertex ordering is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC to reorder the vertices for optimal mesh connectivity encoding.
  • the input bitstream 750 is demultiplexed 740 to generate the compressed bitstreams for vertex coordinates and attributes data and mesh connectivity.
  • the vertex coordinates and attributes data are decompressed using MPEG-I V-PCC decoder 730.
  • Vertex reordering 725 is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC decoder 730 to match the vertex order at the encoder.
  • Mesh connectivity data is decompressed using mesh connectivity decoder 735.
  • the decompressed data is multiplexed 720 to generate the reconstructed mesh 710.
  • 3D graphics objects are represented in 3D space, but the end-user often consumes the content from a flat 2D screen.
  • the process of converting 3D representations into 2D images is generally referred to as “rendering”, which may require dedicated hardware support to enable real-time conversion.
  • Hardware capabilities may be exposed by 3D graphics interfaces such as OpenGL, Vulkan, DirectX or Metal.
  • 3D graphics interfaces such as OpenGL, Vulkan, DirectX or Metal.
  • the functionality offered by these interfaces can be roughly divided into two parts: the first transforms 3D coordinates into 2D screen coordinates and the second part transforms the 2D coordinates into actual colored pixels visible to the end user.
  • the general pipeline handling these transformations is however much more complex and offers programmable stages to accommodate large variety of different rendering techniques.
  • Each step of the graphics pipeline takes as input the output of the previous step.
  • Each programmable step of the pipeline is highly parallelized and optimized to perform a specific task. This leverages the underlying hardware capabilities of the graphics programming units (GPU), which contain thousands of parallel high-frequency processing units and shared memory. These parallel cores are programmable and run small programs called shaders to accommodate the artistic freedom.
  • GPU graphics programming units
  • the input to 3D graphics pipeline may comprise 3D models, which may consist of meshes or point clouds, which both share the same core primitive, a vertex.
  • Point clouds associate vertex specific attributes such as color for each point to generate a 3D representation.
  • Meshes construct faces by connecting multiple vertices.
  • Figure 8 illustrates a simplified illustration of how a 3D input model 810 is converted into 2D representation with the graphics pipeline.
  • meshes specifically include the connectivity information between vertices.
  • the connectivity data defines how vertices are connected to form faces and larger surfaces. Materials and textures may then be assigned on the surfaces using UV-coordinates and material descriptions. Higher level of visual detail on a surface may then be sampled from the associated texture and materials.
  • the textures and materials determine the 3D objects visual appearance in combination with the lighting in the scene.
  • Draco compression is for compressing and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics. It supports compressing points, connectivity information, texture coordinates, color information, normals and any other generic attributes associated with geometry.
  • Draco may compress the 3D mesh either sequentially or using an edgebreaker algorithm.
  • the input to Draco encoder can be any 3D model and the encoder compresses it into a Draco bitstream.
  • the compressed bitstream may be decoded back into the original 3D model, or if needed, transcoded into a different format.
  • the present embodiments are discussed by using Draco as an example of a 3D data compression technique. However, it is appreciated that instead of Draco compression technique, also other techniques suitable for to be used for compressing mesh files can utilize the teachings of the present embodiments.
  • the compressed bitstream is divided into four parts: Draco header, metadata, connectivity, and attributes.
  • Figure 9 illustrates the general structure of the compressed bitstream, where three different alternative structures 910, 920, 930 for the connectivity data 903 have been separated.
  • the Draco header 901 contains high-level information about the bitstream and identified the bitstream as Draco bitstream.
  • the fields of the header 901 are described below:
  • the metadata part 902 allows to associate per attribute metadata or tile level metadata with the Draco bitstream. It can be, for example, used to describe attribute names.
  • the metadata 902 consists of following information and enables recursively adding more level of sub-metadata.
  • the metadata 902 is represented by key-value pairs. void DecodeMetadata ( ) ⁇
  • DecodeMetadataElement ( file metadata) ; void ParseMetadataCount ( ) ⁇ num att metadata
  • DecodeMetadataElement (metadata. sub_metadata [i] ) ;
  • the structure of the connectivity part 903 depends on the encoding method. It consists of either sequential or edgebreaker information.
  • the type of connectivity information is defined in the Draco header (encoder_method).
  • the sequential connectivity header contains the following fields: void ParseSequentialConnectivityData ( ) ⁇ num faces num points connectivity method
  • the connectivity header contains information such as the number of faces and points in the connectivity bitstream as well as the connectivity method which identifies if the sequential connectivity bitstream consists of compressed indices or uncompressed indices.
  • the rest of the connectivity data contains the connectivity bitstream.
  • the compressed Draco bitstream may consist of edgebreaker encoded connectivity data instead of the sequential connectivity data.
  • the connectivity header contains information as defined by ParseEdgebreakerConnectivityData structure. void ParseEdgebreakerConnectivityData ( ) ⁇ edgebreaker traversal type num encoded vertices num faces num attribute data num encoded symbols num encoded split symbols
  • the header provides information such as the traversal type, which indicates the type of the edgebreaker connectivity bitstream. This can be either standard edgebreaker (0) or valence edgebreaker (2). Additionally, it contains information such as the number of encoded vertices and attributes It also provides information on the number of encoder symbols and split symbols, which are required to decode different parts of the edgebreaker encoded connectivity bitstream. In addition to the connectivity header, the connectivity data contains the connectivity bitstream.
  • the structure of the connectivity bitstream depends on the traversal type and can include encoded split data, encoded edgebreaker symbol data, encoded start face configuration data and the attribute connectivity data. It can additionally include valence header and context data in case edgrebreaker valence traversal type is used.
  • the second part of the attribute data is comprised of compressed attributes, such as positions, texture coordinates, normals, etc.
  • Each attribute type section is comprised of one or more unique components.
  • ISOBMFF allows storage of timely captured audio/visual media streams, called media tracks.
  • the metadata which describes the track is separated from the encoded bitstream itself.
  • the format provides mechanisms to access media data in a codec-agnostic fashion from file parser perspective.
  • the media data may be provided in one or more instances of MediaDataBox ‘mdat‘ and the MovieBox ‘moov’ may be used to enclose the metadata for timed media.
  • both of the ‘mdat’ and ‘moov’ boxes may be required to be present.
  • the ‘moov’ box may include one or more tracks, and each track may reside in one corresponding TrackBox ‘trak’.
  • Each track is associated with a handler, identified by a four-character code, specifying the track type.
  • Video, audio, and image sequence tracks can be collectively called media tracks, and they contain an elementary media stream.
  • Other track types comprise hint tracks and timed metadata tracks.
  • Tracks comprise samples, such as audio or video frames.
  • a media sample may correspond to a coded picture or an access unit.
  • a media track refers to samples (which may also be referred to as media samples) formatted according to a media compression format (and its encapsulation to the ISO base media file format).
  • a hint track refers to hint samples, containing cookbook instructions for constructing packets for transmission over an indicated communication protocol.
  • a timed metadata track may refer to samples describing referred media and/or hint samples.
  • the 'trak' box includes in its hierarchy of boxes the SampleTableBox (also known as the sample table or the sample table box).
  • the SampleTableBox contains the SampleDescriptionBox, which gives detailed information about the coding type used, and any initialization information needed for that coding.
  • the SampleDescriptionBox contains an entry-count and as many sample entries as the entry-count indicates.
  • the format of sample entries is track-type specific but derived from generic classes (e.g., VisualSampleEntry, AudioSampleEntry). The type of sample entry form used for derivation the track-type specific sample entry format is determined by the media handler of the track.
  • SampleEntry boxes may contain “extra boxes” not explicitly defined in the box syntax of ISO/IEC 14496-12. When present, such boxes shall follow all defined fields and should follow any defined contained boxes. Decoders shall presume a sample entry box could contain extra boxes and shall continue parsing as though they are present until the containing box length is exhausted.
  • SAP Type 1 corresponds to what is known in some coding schemes as a “Closed group of pictures (GOP) random access point” (in which all pictures, in decoding order, can be correctly decoded, resulting in a continuous time sequence of correctly decoded pictures with no gaps) and in addition the first picture in decoding order is also the first picture in presentation order.
  • SAP Type 2 corresponds to what is known in some coding schemes as a “Closed GOP random access point” (in which all pictures, in decoding order, can be correctly decoded, resulting in a continuous time sequence of correctly decoded pictures with no gaps), for which the first picture in decoding order may not be the first picture in presentation order.
  • SAP Type 3 corresponds to what is known in some coding schemes as an “Open GOP random access point”, in which there may be some pictures in decoding order that cannot be correctly decoded and have presentation times less than intra-coded picture associated with the SAP.
  • a stream access point (SAP) sample group as specified in ISOBMFF identifies samples as being of the indicated SAP type.
  • a sync sample may be defined as a sample corresponding to SAP type 1 or 2.
  • a sync sample can be regarded as a media sample that starts a new independent sequence of samples; if decoding starts at the sync sample, it and succeeding samples in decoding order can all be correctly decoded, and the resulting set of decoded samples forms the correct presentation of the media starting at the decoded sample that has the earliest composition time.
  • Sync samples can be indicated with the SyncSampleBox (for those samples whose metadata is present in a TrackBox) or within sample flags indicated or inferred for track fragment runs.
  • MetaBox ‘meta’ which may also be called MetaBox. While the name of the meta box refers to metadata, items can generally contain metadata or media data.
  • the meta box may reside at the top level of the file, within a MovieBox ‘moov’, and within a TrackBox ‘trak’, but at most one meta box may occur at each of the file level, movie level, or track level.
  • the meta box may be required to contain a HandlerReferenceBox ‘hdlr’ indicating the structure or format of the MetaBox ‘meta’ contents.
  • the MetaBox may list and characterize any number of items that can be referred and each one of them can be associated with a file name and can be uniquely identified with the file by item identifier (item_id) which is an integer value.
  • the metadata items may be for example stored in ItemDataBox 'idat' of the MetaBox or in an 'mdat' box or reside in a separate file. If the metadata is located external to the file, then its location may be declared by the DatalnformationBox ‘dinf’.
  • the metadata is formatted using extensible Markup Language (XML) syntax and is required to be stored directly in the MetaBox, the metadata may be encapsulated into either the XMLBox ‘xml‘ or the BinaryXMLBox ‘bxml’.
  • An item may be stored as a contiguous byte range, or it may be stored in several extents, each being a contiguous byte range. In other words, items may be stored fragmented into extents, e.g., to enable interleaving.
  • An extent is a contiguous subset of the bytes of the resource, and the resource can be formed by concatenating the extents.
  • High Efficiency Image File Format is a standard developed by the Moving Picture Experts Group (MPEG) for storage of images and image sequences.
  • MPEG Moving Picture Experts Group
  • the standard facilitates file encapsulation of data coded according to the High Efficiency Video Coding (HEVC) standard.
  • HEIF includes features building on top of the used ISO Base Media File Format (ISOBMFF).
  • the ISOBMFF structures and features are used to a large extent in the design of HEIF.
  • the basic design for HEIF comprises that still images are stored as items and image sequences are stored as tracks.
  • the following boxes may be contained within the root-level 'meta' box and may be used as described hereinafter.
  • the handler value of the Handler box of the 'meta' box is 'pict'.
  • the resource (whether within the same file, or in an external file identified by a uniform resource identifier) containing the coded media data is resolved through the Data Information Box 'dinf , whereas the ItemLocationBox 'Hoc’ box stores the position and sizes of every item within the referenced file.
  • the ItemReferenceBox 'iref documents relationships between items using typed referencing. If there is an item among a collection of items that is in some way to be considered the most important compared to others, then this item is signaled by the PrimaryltemBox 'pitm' . Apart from the boxes mentioned here, the 'meta' box is also flexible to include other boxes that may be necessary to describe items.
  • Any number of image items can be included in the same file.
  • certain relationships may be qualified between images. Examples of such relationships include indicating a cover image for a collection, providing thumbnail images for some or all of the images in the collection, and associating some or all of the images in a collection with an auxiliary image such as an alpha plane.
  • a cover image among the collection of images is indicated using the 'pitm' box.
  • a thumbnail image or an auxiliary image is linked to the primary image item using an item reference of type 'thmb' or 'auxl', respectively.
  • the ItemPropertiesBox enables the association of any item with an ordered set of item properties.
  • Item properties are small data records.
  • the ItemPropertiesBox consists of two parts: an ItemPropertyContainerBox that contains an implicitly indexed list of item properties, and one or more ItemPropertyAssociationBox(es) that associate items with item properties.
  • An item property is formatted as a box.
  • a descriptive item property may be defined as an item property that describes rather than transforms the associated item.
  • a transformative item property may be defined as an item property that transforms the reconstructed representation of the image item content.
  • An entity group is a grouping of items, which may also group tracks.
  • the entities in an entity group share a particular characteristic or have a particular relationship, as indicated by the grouping type.
  • Entity groups are indicated in GroupsListBox.
  • Entity groups specified in GroupsListBox of a file-level MetaBox refer to tracks or file-level items.
  • Entity groups specified in GroupsListBox of a movie-level MetaBox refer to movie-level items.
  • Entity groups specified in GroupsListBox of a track-level MetaBox refer to track-level items of that track.
  • GroupsListBox contains EntityToGroupBoxes, each specifying one entity group.
  • the four-character box type of EntityToGroupBox denotes a defined grouping type.
  • Draco bitstream is widely used in computer graphics industry and used as an example when discussing details of the present embodiments. Draco bitstream does not contain means for storing texture information inside the bitstream themselves, which means that if a vendor wishes to distribute Draco compressed bitstreams, separate texture file(s) would be needed. This means that other standards such as gITF are used to carry Draco compressed bitstreams along with the texture file(s).
  • ISOBMFF is an MPEG systems standard which defines how compressed bitstreams can be stored in a single file and provides the requested temporal synchronization features for timed data. It does not currently support storage of Draco compressed bitstreams. Should MPEG proceed with Draco compressed bitstreams in mesh coding, their storage is ISOBMFF may become topical. Furthermore, even without MPEG mesh compression work, there might be a value in providing a single file storage for Draco compressed bitstreams and the associated texture data.
  • the present embodiments relate to a storage of timed and non-timed Draco compressed bitstreams in ISOBMFF file along with the associated texture image(s) or video(s). New signalling is provided to indicate that a file contains Draco compressed data.
  • the file may contain: one or more tracks with Draco bitstream and zero or more related texture tracks, or - one or more tracks with Draco connectivity sub-bitstreams, one or more tracks with Draco attribute sub-bitstreams, and zero or more related texture tracks or items.
  • Encapsulation of static data in ISOBMFF is done using items.
  • Each file can contain one or more items with full Draco bitstream and zero or more related texture items.
  • Item based storage is illustrated in Figure 14.
  • Draco bitstream comprises Draco header, Draco metadata, Draco connectivity header, Draco connectivity sub-bitstream, Draco attribute header, and one or more Draco attribute sub-bitstreams.
  • Draco bitstream can be either static or timed, which means that different mechanisms for ISOBMFF encapsulation need to be considered.
  • timed data is stored in tracks and non-timed data is stored as items. These principles are respected and thus syntax elements for both encapsulations are considered.
  • the encapsulated data is the same Draco bitstream, thus defining common syntax elements and boxes to be used on both timed and non-timed encapsulation is considered. Common structures and boxes
  • ‘xxxx’ will be unique four-character code identifier for Draco header box.
  • draco_string must be equal to “DRACO”.
  • major_version indicates the major version number of the bitstream.
  • minor_version indicates the minor version number of the bitstream.
  • encoder_type indicates if the content has been encoded as point clouds or as triangular mesh.
  • encoder_type equal to 0 indicates that the content is point clouds and encoder_type equal to 1 indicates a triangular mesh.
  • encoder_method indicates the encoding method of the bitstream.
  • encoder_method 0 indicates sequential encoding and encoder_method equal to 1 indicates edgebreaker encoding method
  • flags field contains 16-bits for signaling flags, e.g., flag indicating the presence of metadata in the bitstream users mask 32767.
  • Metadata_key_size provides the size of the array that holds the key for the metadata
  • key contains the key for the metadata
  • metadata_value_size contains the size of the array that holds the value of the metadata
  • value contains the value for the metadata.
  • num_sub_metadata provides the number of sub metadata elements.
  • sub_metadata_key_size provides the size of the array that holds the key for sub metadata.
  • sub_metadata_key holds the key for the sub metadata.
  • sub_metadata contains the recursive next level of the sub metadata.
  • DracoMetadata file metadata
  • num_attr_metadata provides the number of attribute specific metadata structures.
  • attr_metadata_id indicates the identifier for the attribute with which the metadata is associated with.
  • attr_metadata contains the metadata for the attribute.
  • file_metadata contains file level metadata for the bitstream.
  • the compressed Draco bitstream consist of two parts: connectivity data and attribute data.
  • Connectivity defines how vertices are connected to each other to form a mesh.
  • Attribute data defines attributes that are associated with said vertices, like position, color or normals. In the bitstream this data is represented in compressed form with related headers.
  • max_num_faces contains the maximum number of encoded faces in the bitstream.
  • max_num_points contains the maximum number of encoded points in the bitstream.
  • connectivity_method indicates if the sequentially coded indices are compressed or uncompressed. connectivity_method equal to 0 indicates that the indices are compressed, whereas connectivity method equal to 1 indicates that the indices are uncompressed.
  • edgebreaker_based encoding method i.e., encoder_method equal to 1
  • edgrebreaker_traversal_type indicates the traversal type of edgebreaker.
  • edgebreaker_traversal_type 0 indicates standard edgebreaker traversal, equal to 1 indicates predictive edgebreaker traversal and equal to 2 indicates valence based edgebreaker traversal.
  • max_num_encoded_vertices indicates the maximum number of encoded vertices.
  • max_num_faces indicates the maximum number of encoded faces.
  • num_attribute_data indicates the number of encoded attributes.
  • num_encoded_symbols indicates the number of encoded edgebreaker symbols.
  • num_encoded_split_symbols indicates the number of encoded edgebreaker split symbols.
  • the information in the connectivity header provides client application an overview of the content in the bitstream.
  • the information regarding the maximum number of faces as points as well as the connectivity type indicate to the client what kind of resources are required to play back the content.
  • DracoAttributeHeader attribute header (encoder method) ;
  • ⁇ num_attributes_decoders indicates the number of attributes in the attribute sub- bitstream.
  • att_dec_data_id indicates the decoder identifier for a given attribute decoder.
  • att_dec_decoder_type indicates the type of the attribute decoder.
  • att_dec_decoder_type 0 indicates mesh vertex attribute decoder and equal to 1 indicates mesh corner attribute decoder.
  • att_dec_traversal_method indicates the traversal method for the encoder.
  • att_dec_traversal_method equal to 0 indicates depth-first traversal and equal to 1 indicates prediction degree-based traversal method.
  • att_dec_num_attributes indicates the number of attributes per attribute to be decoded.
  • Att_dec_att_type indicates the type of the attribute, which can be position, UV-coord inate, normal or other per vertex associated data.
  • att_dec_data_type indicates the attribute data type, which can be a floating point UINT8 or similar.
  • att_dec_num_components indicates the component count for the attribute, i.e., the number of components that represent the attribute. For a UV- coordinate attribute type the component count would be two.
  • att_dec_normalized indicates if the attribute represents normalized data.
  • att_dec_unique_id indicates the unique decoded id of the attribute.
  • seq_att_dec_decoder_type indicates the sequential decoder, equal to 1 indicates integer decoder, equal to 2 indicates quantization decoder and equal to 3 indicates normal decoder.
  • the DracoConnectivityHeader and DracoAttributeHeader contain important information that is needed to configure the draco_decoder. For different decoder implementations it may make sense to expose this information in a place, where file parser can easily access it. This allows an implementation to fail early if the bitstream contains technologies that are not supported. Header information also provide general information about the content such as maximum number of vertices, which can be useful for the client to decide if the content can be rendered in realtime or if adaptation is required.
  • DracoDecoderConfigurationRecord For this purpose, a DracoDecoderConfigurationRecord is defined. aligned ( 8 ) struct DracoDecoderConf igurationRecord ( ) ⁇ unsigned int ( l ) connectivity header inband; unsigned int ( l ) attribute header inband; bit ( 6 ) reserved; DracoConnectivityHeader connectivity header ; DracoAttributeHeader attribute header ;
  • ⁇ connectivity_header_inband indicates if connectivity header(s) are stored as part of the encoded connectivity sub-bitstream.
  • connectivity_header_inband equals to 0
  • the connectivity header is extracted from the bitstream and only present in the DracoDecoderConfigurationRecord.
  • connectivity_header_inband equals to 1
  • the connectivity header is present both in the DracoDecoderConfigurationRecord and in the encoded connectivity sub-bitstream, in which case the connectivity header information in the bitstream takes precedence over the information in the DracoDecoderConfigurationRecord.
  • attribute_header_inband indicates if attribute header(s) are stored as part of the encoded bitstream. The information in the attribute header is unlikely to change during the sequence, but the flag enables easy injection of attribute data in ISOBMFF by preserving the header data in the bitstream.
  • Draco De coder Conf igurationRecord configuration
  • DracoConnectivityData and DracoAttributeData contain the non-modified sub-bitstreams for connectivity and attribute information correspondingly, which means that the information from connectivity header and attribute header is duplicated in the data definitions. This is to present non-modified bitstream directly to the decoder with no modifications. Furthermore, the storage of attribute headers and connectivity headers is not mandated.
  • the encoded connectivity data shall be defined as follows: aligned ( 8 ) struct DracoConnectivityDat ( ) ⁇
  • char ( 8 ) sequential compres sed indices [ size ] ;
  • char ( 8 ) sequential uncompres sed indices [ size ] ;
  • the connectivity_data byte array shall contain exactly one DecodeConnectivityData element as defined in Draco bitstream specification, which means that the header data is included in the array.
  • sequential_compressed_indices byte array shall contain exactly one DecodeSequentialCompressedlndices as defined in Draco bitstream specification.
  • sequential_uncompressed_indices shall contain exactly one DecodedSequentiallndices as defined in Draco bitstream specification.
  • edgebreaker_connectivity_data byte array shall contain exactly one of each DecodeTopologySplitEvents, EdgebreakerTraversalStart,
  • the encoded attribute data shall be accordingly stored as follows: aligned ( 8 ) struct DracoAttributeDat ( ) ⁇
  • attribute_data_including_header byte array shall contain exactly one DecodeAttributeData element as defined in Draco bitstream definition, which means that the attribute header data is included.
  • attribute_data_excluding_header byte array shall contain exactly one DecoderAttributeData element excluding ParseAttributeDecodersData element as defined in Draco bitstream specification. This means that the attribute data does not contain attribute header.
  • the data when storage of dynamic Draco bitstreams is considered, the data can be encapsulated in single-track or multi-track mode.
  • Figure 11 illustrates the single-track encapsulation mode where Draco bitstream is encapsulated in a single track referred to as single-stream Draco track.
  • the singlestream Draco track intends to preserve the bitstream as is, but provide useful information about the bitstream for the file parser.
  • DracoSampleEntry can be defined as follows: aligned ( 8 ) class DracoSampleEntry ( ) extends
  • DracoMetadataBox draco_metadata ;
  • DracoConfigurationBox configuration
  • the samples of Draco track are defined as follows: aligned ( 8 ) class DracoSample ( ) ⁇
  • sample_si ze value is the si ze of the sample from the S ample Si zeBox char draco_payload [ sample_si ze ] ;
  • Draco_payload byte array contains data representing a single element of DecodeConnecticityData and DecodeAttributeData as defined in Draco bitstream specification.
  • the structures explicitly contain the header data for connectivity and attribute.
  • the sample_size information is provided by the SampleSizeBox in ISOBMFF:
  • Draco track can be marked as sync samples as there is no inter prediction between samples.
  • Decoder can take a sample entry information and any of the sample in a track to decode a mesh or point cloud.
  • the ISOBMFF file may carry additional texture attribute information that was compressed as video, e.g., color information corresponding to a Draco compressed mesh.
  • the video compressed texture data can be stored as samples of another video track.
  • a file format should inform a file parser how to link those tracks.
  • a track group may be used. Setting the track group identification of the single-stream Draco track and the video track means that the video track contains texture information that can be consumed together with the Draco bitstream.
  • the single-stream Draco track may consist of sub-samples or sample groups, which allow defining byte-ranges of sample data that correspond to connectivity sub-bitstream or individual attribute sub-bitstreams.
  • This encapsulation is illustrated in Figure 12.
  • the sample entry for such encapsulation may be defined as earlier, with the following restrictions:
  • DracoConfigurationRecord.attribute_header.inband must be set to 0, to indicated that the attribute headers are not part of the sample data.
  • DracoConfigurationRecorder.connectivity_header_inband must be set to 1 .
  • Sub-sample definitions or sample groups for connectivity sub-bitstream and to individual attribute-sub-bitstreams indicate where such data is located
  • Both sub- sample-based and sample group -based signaling must utilize att_dec_unique_id, which indicates the identification of the attribute stored in each sub-sample of sample group.
  • This encapsulation mode is to enable selective access to specific attributes of the Draco bitstream. This allows client implementation to selectively decode only the attribute information that it requires.
  • Samples that belong to connectivity sample group or connectivity sub-samples shall use the following sample format: aligned ( 8 ) clas s DracoConnectivitySample ( ) ⁇
  • si ze i s derived from SampleSi zeBox
  • the size of the sample is derived from
  • the samples that belong to attribute sample group or sub-samples shall use the following sample format: aligned ( 8 ) clas s DracoAttributeSample ( ) ⁇
  • si ze i s derived from SampleSi zeBox
  • the size of the sample is derived from
  • Draco bitstream can be stored in multiple tracks.
  • One track is used to store connectivity data and general configurations, whereas new tracks are used to store individual attributes. This encapsulation is illustrated in Figure 13.
  • the track(s), which contain the connectivity data, shall be referred to as Draco connectivity track(s), whereas the tracks containing the attribute data shall be referred to as Draco attribute tracks.
  • the Draco connectivity track shall contain a DracoSampleEntry as described earlier in this specification, and in addition, a track reference box, which contains track references to all Draco attribute tracks. New four-character code for the track references can be defined. For that, the present disclosure use ‘drat’ as an example.
  • DracoAttributeHeaderBox attribute header ; / / optional ⁇ where ‘bbbb’ represents a unique identifier for the Draco attribute sample entry.
  • att_dec_unique_id indicates the identifier for the attribute information in the track. Together with the attribute header information of the Draco connectivity track, it can be used to decode the attribute information in the track.
  • Draco attribute sample entry may contain an optional Draco attribute header box, which stores only attribute information related to the att_dec_unique_id.
  • Draco bitstream contains static data, i.e. , it does not change temporally, it can be stored as an item in ISOBMFF.
  • Figure 14 illustrates the encapsulation.
  • Draco bitstream is stored as non-timed item a new item property is defined along with item data, item references or entity groups.
  • Figure 14 illustrates encapsulation design using the entity box.
  • the Draco item property is associated with the relevant item with using item property association box.
  • the location of the item is indicated by the item location box.
  • the item data for Draco bitstream can be defined as follows: aligned ( 8 ) clas s DracoItemData ( ) ⁇
  • ⁇ draco_payload byte array contains data representing a single element of DecodeConnectivityData and DecodeAttributeData as specified in Draco bitstream specification.
  • the item data may be defined using DracoConnecitivityData and DracoAttributeData definitions as proposed earlier in this disclosure, if further subdivision and access to item data is desired.
  • aligned ( 8 ) clas s DracoItemData ( ) ⁇
  • DracoConnectivityData connectivity data [ connectivity size ]
  • DracoAttributeData attribute_data [ attribute_si ze ] ⁇ where the connectivity and attribute size and location would be provided by separate values of extent_offset, extentjength signaled in ItemLocationBox ‘Hoc’. It can be enforced that extent_count shall be equal to 2 and the first extent indicates the position of connectivity data, and the second extent indicates the position of attribute data.
  • entity grouping may be considered.
  • an entity group identifies both Draco item and image item, it shall be assumed that the image item contains texture data that can be consumed together with the Draco compressed object.
  • item references can be defined linking Draco item to the image item.
  • a new box can be defined as described below: aligned ( 8 ) clas s Obj ectTrans formation ( ) ⁇ float ( 32 ) position [ 3 ] ; float ( 32 ) rotation_quat [ 3 ] ; float ( 32 ) scale [ 3 ] ;
  • Object transformation box may be placed in Draco sample entry or Draco item property.
  • the method generally comprises receiving 1510 media data comprising three-dimensional models being formed of meshes or point clouds; compressing 1520 a three- dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and storing 1530 the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
  • Each of the steps can be implemented by a respective module of a computer system.
  • An apparatus comprises means for receiving media data comprising three-dimensional models being formed of meshes or point clouds; means for compressing a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and means for storing the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
  • the means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry.
  • the memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 15 according to various embodiments.
  • Figure 16 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec.
  • the electronic device may comprise an encoder or a decoder.
  • the electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device.
  • the electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer.
  • the device may be also comprised as part of a head-mounted display device.
  • the apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the apparatus may further comprise a camera 42 capable of recording or capturing images and/or video.
  • the camera 42 may be a multi-lens camera system having at least two camera sensors.
  • the camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing.
  • the apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
  • the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a IIICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system, or a wireless local area network.
  • the apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
  • the apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.
  • the various embodiments may provide advantages.
  • the present embodiments enable distribution of Draco compressed 3D assets with ISOBMFF and leverage the functionality offered by ISOBMFF such as temporal random access
  • the present embodiments focus on minimal processing requirements to extract original bitstream from ISOBMFF, thus enabling efficient implementations.
  • General information about the bitstream is exposed in ISOBMFF structures that allow client applications to quickly allocate decoding resources or fail fast if a feature of Draco bitstream is not supported by the client. This can be done before any bitstream parsing is initiated.
  • a device may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments.
  • a computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

Abstract

The embodiments relate to a method comprising receiving media data comprising three-dimensional models being formed of meshes or point clouds; compressing a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and storing the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams. The embodiments also relate to an apparatus and a computer program product for implementing the method.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO CODING
Technical Field
The present solution generally relates to video encoding and video decoding.
Background
Volumetric video data represents a three-dimensional (3D) scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality), and MR (Mixed Reality) applications. Such data describes geometry (Shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), and any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video can be generated from 3D models, also referred to as volumetric visual objects, i.e., CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Examples of representation formats for volumetric data comprise triangle meshes, point clouds, or voxels. Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.
Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.
Summary
The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention. Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.
According to a first aspect, there is provided an apparatus comprising means for receiving media data comprising three-dimensional models being formed of meshes or point clouds; means for compressing a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and means for storing the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
According to a second aspect, there is provided a method, comprising: receiving media data comprising three-dimensional models being formed of meshes or point clouds; compressing a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and storing the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
According to a third aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive media data comprising three- dimensional models being formed of meshes or point clouds; compress a three- dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and store the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
According to a fourth aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive media data comprising three- dimensional models being formed of meshes or point clouds; compress a three- dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and store the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
According to an embodiment, it is indicated in a box that a file contains bitstreams being compressed with the algorithm for mesh compression.
According to an embodiment, one or more of the following boxes relating to the compressed geometry bitstreams are included into the box-structured filed format:
- decoder configuration record to indicate information to configure a decoder;
- sample entry to indicate dynamic bitstream in a single-track encapsulation;
- attribute sample entry to indicate attribute information of a track;
- item property to indicate a static data item of the compressed bitstream;
- item data to indicate a location of the static data item;
- object transformation to indicate relative positioning, orientation, and scaling between objects.
According to an embodiment, the algorithm is a Draco compression algorithm.
According to an embodiment, a file comprises one or more tracks with geometry bitstream and zero or more related texture tracks.
According to an embodiment, the file comprises one or more tracks with connectivity sub-bitstreams, one or more tracks with attribute sub-bitstreams, and zero or more related texture tracks.
According to an embodiment, the file is an ISOBMFF file.
According to an embodiment, timed bitstream is stored in tracks in the file.
According to an embodiment, non-timed bitstream is stored as items in the file. According to an embodiment, the computer program product is embodied on a non- transitory computer readable medium.
Description of the Drawings
In the following, various embodiments will be described in more detail with reference to the appended drawings, in which
Fig. 1 shows an example of a compression process of a volumetric video;
Fig. 2 shows an example of a de-compression of a volumetric video;
Fig. 3a shows an example of a volumetric media conversion at an encoder;
Fig. 3b shows an example of a volumetric media reconstruction at a decoder;
Fig. 4 shows an example of block to patch mapping;
Fig. 5a shows an example of an atlas coordinate system;
Fig. 5b shows an example of a local 3D patch coordinate system;
Fig. 5c shows an example of a final target 3D coordinate system;
Fig. 6 shows a V-PCC extension for mesh encoding;
Fig. 7 shows a V-PCC extension for mesh decoding;
Fig. 8 shows a simplified example of rendering pipeline;
Fig. 9 shows a structure of compressed Draco bitstream;
Fig. 10 shows an example of Draco bitstream structure;
Fig. 11 shows an example of single stream Draco track;
Fig. 12 shows an example of single-stream encapsulation with sub-samples; Fig. 13 shows an example of multi-track encapsulation of Draco bitstream;
Fig. 14 shows an example of encapsulation of static Draco bitstream as item;
Fig. 15 is a flowchart illustrating a method according to an embodiment; and
Fig. 16 shows an apparatus according to an embodiment.
Description of Example Embodiments
The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well-known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments.
Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment in included in at least one embodiment of the disclosure.
Volumetric video data represents a three-dimensional scene or object and can be used as input for AR, VR and MR applications. Such data describes geometry (shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), plus any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video is either generated from 3D models, i.e., CGI, or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Representation formats for such volumetric data are triangle meshes, point clouds, or voxels. Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time. Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.
Increasing computational resources and advances in 3D data acquisition devices have enabled reconstruction of highly detailed volumetric video representations or natural scenes. Infrared, lasers, time-of-flight, and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multilevel surface maps.
In the following, a short reference of ISO/IEC DIS 23090-5 Visual Volumetric Videobased Coding (V3C) and Video-based Point Cloud Compression (V-PCC) 2nd Edition is given. Visual volumetric video comprising a sequence of visual volumetric frames, if uncompressed, may be represented by a large amount of data, which can be costly in terms of storage and transmission. This has led to the need for a high coding efficiency standard for the compression of visual volumetric data.
Figure 1 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.
The patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:
- (1.0, 0.0, 0.0),
- (0.0, 1.0, 0.0), - (0.0, 0.0, 1.0),
- (-1.0, 0.0, 0.0),
- (0.0, -1.0, 0.0), and
- (0.0, 0.0, -1.0)
More precisely, each point may be associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal and the plane normal).
The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.
Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105. The packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user-defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.
The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.
The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+A], where is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:
• Geometry: WxH YUV420-8bit,
• Texture: WxH YUV420-8bit,
It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
The geometry images and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.
The padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g., 16x16) pixels is compressed independently. If the block is empty (i.e., unoccupied, i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e., occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e., edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors. The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.
The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding volume, for example a bounding box, 3D location of the patch.
For example, the following metadata may be encoded/decoded for every patch:
- index of the projection plane o Index 0 for the planes (1 .0, 0.0, 0.0) and (-1 .0, 0.0, 0.0) o Index 1 for the planes (0.0, 1 .0, 0.0) and (0.0, -1 .0, 0.0) o Index 2 for the planes (0.0, 0.0, 1 .0) and (0.0, 0.0, -1 .0)
- 2D bounding box (uO, vO, ul, vl)
- 3D location (xO, yO, z0) of the patch represented in terms of depth 30, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (50, sO, rO) may be calculated as follows: o Index 0, 30= xO, s0=z0 and rO = yO o Index 1, 30= yO, s0=z0 and rO = xO o Index 2, 30= zO, s0=x0 and rO = yO
Also, mapping information providing for each TxT block its associated patch index may be encoded as follows:
- For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.
- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks. - Let I be index of the patch, which the current TxT block belongs to, and let J be the position of I in L. Instead of explicitly coding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency.
The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.
The occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.
The compression process may comprise one or more of the following example operations:
• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.
• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.
• A binary information may be encoded for each TxT block to indicate whether it is full or not.
• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.
■ The binary value of the initial sub-block is encoded. ■ Continuous runs of Os and 1 s are detected, while following the traversal order selected by the encoder.
■ The number of detected runs is encoded.
■ The length of each run, except of the last one, is also encoded.
Figure 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202. In addition, the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204. Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.
The reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202. The texture reconstruction 207 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.
The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (30, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth 3(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:
3(u, v) = 30 + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.
For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction.
Visual volumetric video-based Coding (V3C) relates to a core part shared between ISO/IEC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video)). V3C will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1 -8 of the current V-PCC text). ISO/IEC 23090-12 will refer to this common part. ISO/IEC 23090-5 will be renamed to V3C PCC, ISO/IEC 23090-12 renamed to V3C MIV.
V3C enables the encoding and decoding processes of a variety of volumetric media by using video and image coding technologies. This is achieved through first a conversion of such media from their corresponding 3D representation to multiple 2D representations, also referred to as V3C video components, before coding such information. Such representations may include occupancy, geometry, and attribute components. The occupancy component can inform a V3C decoding and/or rendering system of which samples in the 2D components are associated with data in the final 3D representation. The geometry component contains information about the precise location of 3D data in space, while attribute components can provide additional properties, e.g., texture or material information, of such 3D data. An example is shown in Figures 3a and 3b, where Figure 3a presents volumetric media conversion at an encoder, and where Figure 3b presents volumetric media reconstruction at a decoder side. The 3D media is converted to a series of 2D representations: occupancy 301 , geometry 302, and attributes 303. Additional information may also be included in the bitstream to enable inverse reconstruction.
Additional information that allows associating all these V3C video components, and enables the inverse reconstruction from a 2D representation back to a 3D representation is also included in a special component, referred to in this document as the atlas 304. An atlas 304 consists of multiple elements, named as patches. Each patch identifies a region in all available 2D components and contains information necessary to perform the appropriate inverse projection of this region back to the 3D space. The shape of such regions is determined through a 2D bounding volume associated with each patch as well as their coding order. The shape of these regions is also further refined after the consideration of the occupancy information.
Atlases may be partitioned into patch packing blocks of equal size. The 2D bounding volumes of patches and their coding order determine the mapping between the blocks of the atlas image and the patch indices. Figure 4 shows an example of block to patch mapping with 4 projected patches onto an atlas when asps_patch_precedence_order_flag is equal to 0. Projected points are represented with dark grey. The area that does not contain any projected points is represented with light grey. Patch packing blocks are represented with dashed lines. The number inside each patch packing block represents the patch index of the patch to which it is mapped.
Axes orientations are specified for internal operations. For instance, the origin of the atlas coordinates is located on the top-left corner of the atlas frame. For the reconstruction step, an intermediate axes definition for a local 3D patch coordinate system is used. The 3D local patch coordinate system is then converted to the final target 3D coordinate system using appropriate transformation steps.
Figure 5a shows an example of a single patch 520 packed onto an atlas image 510. This patch 520 is then converted to a local 3D patch coordinate system (U, V, D) defined by the projection plane with origin O’, tangent (U), bi-tangent (V), and normal (D) axes. For an orthographic projection, the projection plane is equal to the sides of an axis-aligned 3D bounding volume 530, as shown in Figure 5b. The location of the bounding volume 530 in the 3D model coordinate system, defined by a lefthanded system with axes (X, Y, Z), can be obtained by adding offsets TilePatch3dOffsetU, TilePatch3DOffsetV, and TilePatch3DOffsetD, as illustrated in Figure 5c.
Coded V3C video components are referred to in this disclosure as video bitstreams, while a coded atlas is referred to as the atlas bitstream. Video bitstreams and atlas bitstreams may be further split into smaller units, referred to here as video and atlas sub-bitstreams, respectively, and may be interleaved together, after the addition of appropriate delimiters, to construct a V3C bitstream. V3C patch information is contained in atlas bitstream, atlas_sub_bitstream(), which contains a sequence of NAL units. NAL unit is specified to format data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes. A NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet-oriented transport and sample streams is identical except that in the sample stream format specified in Annex D of ISO/IEC 23090-5 each NAL unit can be preceded by an additional element that specifies the size of the NAL unit.
NAL units in atlas bitstream can be divided to atlas coding layer (ACL) and non-atlas coding layer (non-ACL) units. The former dedicated to carry patch data while the later to carry data necessary to properly parse the ACL units or any additional auxiliary data.
In the nal_unit_header() syntax nal_unit_type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 4 of ISO/IEC 23090-5. nal_layer_id specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. The value of nal_layer_id shall be in the range of 0 to 62, inclusive. The value of 63 may be specified in the future by ISO/IEC. Decoders conforming to a profile specified in Annex A of ISO/IEC 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of nal_layer_id not equal to 0. rbsp_byte[ i ] is the i-th byte of an RBSP. An RBSP is specified as an ordered sequence of bytes as follows:
The RBSP contains a string of data bits (SODB) as follows:
• If the SODB is empty (i.e., zero bits in length), the RBSP is also empty.
• Otherwise, the RBSP contains the SODB as follows: o The first byte of the RBSP contains the first (most significant, left-most) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain. o The rbsp_trailing_bits( ) syntax structure is present after the SODB as follows: ■ The first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any).
■ The next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit).
■ When the rbsp_stop_one_bit is not the last bit of a byte-aligned byte, one or more bits equal to 0 (i.e., instances of rbsp_alignment_zero_bit) are present to result in byte alignment.
One or more cabac_zero_word 16-bit syntax elements equal to 0x0000 may be present in some RBSPs after the rbsp_trailing_bits( ) at the end of the RBSP.
Syntax structures having these RBSP properties are denoted in the syntax tables using an "_rbsp" suffix. These structures are carried within NAL units as the content of the rbsp_byte[ i ] data bytes. As an example, the following may be considered as typical content:
• atlas_sequence_parameter_set_rbsp( ), which is used to carry parameters related to atlas on a sequence level.
• atlas_frame_parameter_set_rbsp( ), which is used to carry parameters related to atlas on a frame level and are valid for one or more atlas frames.
• sei_rbsp( ), used to carry SEI messages in NAL units.
• atlas_tile_group_layer_rbsp( ), used to carry patch layout information for tile groups.
When the boundaries of the RBSP are known, the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, right-most) bit equal to 1 , and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0. The data necessary for the decoding process is contained in the SODB part of the RBSP. atlas_tile_group_laye_rbsp() contains metadata information for a list off tile groups, which represent sections of frame. Each tile group may contain several patches for which the metadata syntax is described below.
Figure imgf000017_0001
Figure imgf000018_0001
Annex F of V3C V-PCC specification (23090-5) describes different SEI messages that have been defined for V3C MIV purposes. SEI messages assist in processes related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non-essential. V3C SEI messages are signaled in sei_rspb() which is documented below.
Figure imgf000018_0002
Non-essential SEI messages are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance. Specification for presence of non-essential SEI messages is also satisfied when those messages (or some subset of them) are conveyed to decoders (or to the HRD) by other means not specified in V3C V-PCC specification (23090-5). When present in the bitstream, non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5). When the content of a non-essential SEI message is conveyed for the application by some means other than presence within the bitstream, the representation of the content of the SEI message is not required to use the same syntax specified in annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are present in the bitstream are counted.
Essential SEI messages are an integral part of the V3C bitstream and should not be removed from the bitstream. The essential SEI messages are categorized into two types:
• Type-A essential SEI messages: These SEIs contain information required to check bitstream conformance and for output timing decoder conformance. Every V3C decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance.
• Type-B essential SEI messages: V3C decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type-B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes.
A polygon mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral object in 3D computer graphics and solid modelling. The faces usually consist of triangles (triangle mesh), quadrilaterals (quads), or other simple convex polygons (n-gons), since this simplifies rendering, but may also be more generally composed of concave polygons, or even polygons with holes. Objects created with polygon meshes are represented by different types of elements. These include vertices, edges, faces, polygons, and surfaces. In many applications, only vertices, edges and either faces or polygons are stored.
Polygon meshes are defined by the following elements:
• Vertex: A position in 3D space defined as (x, y, z) along with other information such as color (r, g, b), normal vector and texture coordinates.
• Edge: A connection between two vertices. • Face: A closed set of edges, in which a triangle face has three edges, and a quad face has four edges. A polygon is a coplanar set of faces. In systems that support multi-sided faces, polygons and faces are equivalent. Mathematically a polygonal mesh may be considered an unstructured grid, or undirected graph, with additional properties of geometry, shape, and topology.
• Surfaces: or smoothing groups, are useful, but not required to group smooth regions.
• Groups: Some mesh formats contain groups, which define separate elements of the mesh, and are useful for determining separate sub-objects for skeletal animation or separate actors for non-skeletal animation.
• Materials: defined to allow different portions of the mesh to use different shaders when rendered.
• UV coordinates: Most mesh formats also support some form of UV coordinates which are a separate 2D representation of the mesh "unfolded" to show what portion of a 2-dimensional texture map applies to different polygons of the mesh. It is also possible for meshes to contain other vertex attribute information such as color, tangent vectors, weight maps to control animation, etc. (sometimes also called channels).
Figure 6 and Figure 7 show the extensions to the V3C encoder and decoder to support mesh encoding and mesh decoding.
In the encoder extension, shown in Figure 6, the input mesh data 610 is demultiplexed 620 into vertex coordinate and attributes data 625 and mesh connectivity 627, where the mesh connectivity comprises vertex connectivity information. The vertex coordinate and attributes data 625 is coded using MPEG-I V-PCC 630 (such as shown in Figure 1 ), whereas the mesh connectivity data 627 is coded in mesh connectivity encoder 635 as auxiliary data. Both of these are multiplexed 640 to create the final compressed output bitstream 650. Vertex ordering is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC to reorder the vertices for optimal mesh connectivity encoding.
At the decoder, shown in Figure 7, the input bitstream 750 is demultiplexed 740 to generate the compressed bitstreams for vertex coordinates and attributes data and mesh connectivity. The vertex coordinates and attributes data are decompressed using MPEG-I V-PCC decoder 730. Vertex reordering 725 is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC decoder 730 to match the vertex order at the encoder. Mesh connectivity data is decompressed using mesh connectivity decoder 735. The decompressed data is multiplexed 720 to generate the reconstructed mesh 710.
3D graphics
In 3D graphics objects are represented in 3D space, but the end-user often consumes the content from a flat 2D screen. The process of converting 3D representations into 2D images is generally referred to as “rendering”, which may require dedicated hardware support to enable real-time conversion. Hardware capabilities may be exposed by 3D graphics interfaces such as OpenGL, Vulkan, DirectX or Metal. The functionality offered by these interfaces can be roughly divided into two parts: the first transforms 3D coordinates into 2D screen coordinates and the second part transforms the 2D coordinates into actual colored pixels visible to the end user.
The general pipeline handling these transformations is however much more complex and offers programmable stages to accommodate large variety of different rendering techniques. Each step of the graphics pipeline takes as input the output of the previous step. Each programmable step of the pipeline is highly parallelized and optimized to perform a specific task. This leverages the underlying hardware capabilities of the graphics programming units (GPU), which contain thousands of parallel high-frequency processing units and shared memory. These parallel cores are programmable and run small programs called shaders to accommodate the artistic freedom.
The input to 3D graphics pipeline may comprise 3D models, which may consist of meshes or point clouds, which both share the same core primitive, a vertex. Point clouds associate vertex specific attributes such as color for each point to generate a 3D representation. Meshes construct faces by connecting multiple vertices. The advantage of using mesh-based 3D representation is the watertightness of the generated surface when compared to point cloud representation which requires a lot of points to represent a visually solid surface. Figure 8 illustrates a simplified illustration of how a 3D input model 810 is converted into 2D representation with the graphics pipeline. In comparison to point cloud -based 3D representation, meshes specifically include the connectivity information between vertices. The connectivity data defines how vertices are connected to form faces and larger surfaces. Materials and textures may then be assigned on the surfaces using UV-coordinates and material descriptions. Higher level of visual detail on a surface may then be sampled from the associated texture and materials. The textures and materials determine the 3D objects visual appearance in combination with the lighting in the scene.
A technique, called “Draco compression” is for compressing and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics. It supports compressing points, connectivity information, texture coordinates, color information, normals and any other generic attributes associated with geometry. Draco may compress the 3D mesh either sequentially or using an edgebreaker algorithm. The input to Draco encoder can be any 3D model and the encoder compresses it into a Draco bitstream. The compressed bitstream may be decoded back into the original 3D model, or if needed, transcoded into a different format. The present embodiments are discussed by using Draco as an example of a 3D data compression technique. However, it is appreciated that instead of Draco compression technique, also other techniques suitable for to be used for compressing mesh files can utilize the teachings of the present embodiments.
The compressed bitstream is divided into four parts: Draco header, metadata, connectivity, and attributes. Figure 9 illustrates the general structure of the compressed bitstream, where three different alternative structures 910, 920, 930 for the connectivity data 903 have been separated.
The Draco header 901 contains high-level information about the bitstream and identified the bitstream as Draco bitstream. The fields of the header 901 are described below:
ParseHeader ( ) { draco string maj or version minor version encoder_type encoder method flags
} The metadata part 902 allows to associate per attribute metadata or tile level metadata with the Draco bitstream. It can be, for example, used to describe attribute names. The metadata 902 consists of following information and enables recursively adding more level of sub-metadata. The metadata 902 is represented by key-value pairs. void DecodeMetadata ( ) {
ParseMetadataCount () ; for (i = 0; i < num att metadata; ++i) {
ParseAttributeMetadatald (i) ;
DecodeMetadataElement (att_metadata [i] ) ; }
DecodeMetadataElement ( file metadata) ; void ParseMetadataCount ( ) { num att metadata
} void ParseAttributeMetadatald (index) { att metadata id [index]
} void ParseMetadataElement (metadata) { metadata. num entries for (i = 0; i < metadata. num entries; ++i) { sz = metadata . key_size [ i ] metadata . key [ i ] sz = metadata . value size[i] metadata . value [i]
} metadata. num sub metadata
} void ParseSubMetadataKey (metadata, index) { sz = metadata. sub metadata key size [index] metadata. sub metadata key [index]
} void DecodeMetadataElement (metadata) { ParseMetadataElement (metadata) ; for (i = 0; i < metadata. num sub metadata; ++i) { ParseSubMetadataKey (metadata, i) ;
DecodeMetadataElement (metadata. sub_metadata [i] ) ;
}
}
The structure of the connectivity part 903 depends on the encoding method. It consists of either sequential or edgebreaker information. The type of connectivity information is defined in the Draco header (encoder_method). The sequential connectivity header contains the following fields: void ParseSequentialConnectivityData ( ) { num faces num points connectivity method
}
The connectivity header contains information such as the number of faces and points in the connectivity bitstream as well as the connectivity method which identifies if the sequential connectivity bitstream consists of compressed indices or uncompressed indices. The rest of the connectivity data contains the connectivity bitstream.
Alternatively, the compressed Draco bitstream may consist of edgebreaker encoded connectivity data instead of the sequential connectivity data. In this case the connectivity header contains information as defined by ParseEdgebreakerConnectivityData structure. void ParseEdgebreakerConnectivityData ( ) { edgebreaker traversal type num encoded vertices num faces num attribute data num encoded symbols num encoded split symbols
}
The header provides information such as the traversal type, which indicates the type of the edgebreaker connectivity bitstream. This can be either standard edgebreaker (0) or valence edgebreaker (2). Additionally, it contains information such as the number of encoded vertices and attributes It also provides information on the number of encoder symbols and split symbols, which are required to decode different parts of the edgebreaker encoded connectivity bitstream. In addition to the connectivity header, the connectivity data contains the connectivity bitstream.
The structure of the connectivity bitstream depends on the traversal type and can include encoded split data, encoded edgebreaker symbol data, encoded start face configuration data and the attribute connectivity data. It can additionally include valence header and context data in case edgrebreaker valence traversal type is used.
The attribute data 904 contains two sections. The first part is the attribute header, which indicates how many attributes need to be decoded as well as what components each attribute consist of. void ParseAttributeDecodersData ( ) { num attributes decoders if (encoder_method == MESH_EDGEBREAKER_ENCODING) { for (i = 0; i < num attributes decoders; ++i) { att_dec_data_id [ i ] att_dec_decoder_type [i] att dec traversal method[i]
}
} for (i = 0; i < num attributes decoders; ++i) { att_dec_num_attributes [i] for (j = 0; j < att_dec_num_attributes [ i ] ; + + j ) { att_dec_att_type [ i ] [ j ] att_dec_data_type [ i ] [j] att_dec_num_components [ i ] [j] att_dec_normalized [ i ] [j] att_dec_unique_id [ i ] [j]
} for (j = 0; j < att_dec_num_attributes [ i ] ; ++ j ) { seq_att_dec_decoder_type [ i ] [j]
}
}
}
The second part of the attribute data is comprised of compressed attributes, such as positions, texture coordinates, normals, etc. Each attribute type section is comprised of one or more unique components.
ISOBMFF
Box-structured and hierarchical file format concepts have been widely used for media storage and sharing. The most well-known file formats in this regard are the ISO Base Media File Format (ISOBMFF) and its variants such as MP4 and 3GPP file formats.
ISOBMFF allows storage of timely captured audio/visual media streams, called media tracks. The metadata which describes the track is separated from the encoded bitstream itself. The format provides mechanisms to access media data in a codec-agnostic fashion from file parser perspective.
In files conforming to the ISO base media file format, the media data may be provided in one or more instances of MediaDataBox ‘mdat‘ and the MovieBox ‘moov’ may be used to enclose the metadata for timed media. In some cases, for a file to be operable, both of the ‘mdat’ and ‘moov’ boxes may be required to be present. The ‘moov’ box may include one or more tracks, and each track may reside in one corresponding TrackBox ‘trak’. Each track is associated with a handler, identified by a four-character code, specifying the track type. Video, audio, and image sequence tracks can be collectively called media tracks, and they contain an elementary media stream. Other track types comprise hint tracks and timed metadata tracks.
Tracks comprise samples, such as audio or video frames. For video tracks, a media sample may correspond to a coded picture or an access unit. A media track refers to samples (which may also be referred to as media samples) formatted according to a media compression format (and its encapsulation to the ISO base media file format). A hint track refers to hint samples, containing cookbook instructions for constructing packets for transmission over an indicated communication protocol. A timed metadata track may refer to samples describing referred media and/or hint samples.
SampleDescriptionBox
The 'trak' box includes in its hierarchy of boxes the SampleTableBox (also known as the sample table or the sample table box). The SampleTableBox contains the SampleDescriptionBox, which gives detailed information about the coding type used, and any initialization information needed for that coding. The SampleDescriptionBox contains an entry-count and as many sample entries as the entry-count indicates. The format of sample entries is track-type specific but derived from generic classes (e.g., VisualSampleEntry, AudioSampleEntry). The type of sample entry form used for derivation the track-type specific sample entry format is determined by the media handler of the track. aligned ( 8 ) abstract clas s SampleEntry (unsigned int ( 32 ) format ) extends Box ( format ) { const unsigned int ( 8 ) [ 6 ] reserved = 0 ; unsigned int ( 16 ) data reference index ;
} aligned ( 8 ) clas s SampleDescriptionBox (unsigned int ( 32 ) handler_type ) extends FullBox ( ' stsd ' , version , 0 ) { int i ; unsigned int ( 32 ) entry count ; for ( i = 1 ; i <= entry count ; i++ ) { SampleEntry ( ) ; / / an instance of a class derived from SampleEntry
} }
Derived specifications deriving Sample Entry classes defined in ISO/IEC 14496-12. SampleEntry boxes may contain “extra boxes” not explicitly defined in the box syntax of ISO/IEC 14496-12. When present, such boxes shall follow all defined fields and should follow any defined contained boxes. Decoders shall presume a sample entry box could contain extra boxes and shall continue parsing as though they are present until the containing box length is exhausted.
Sync Samples in ISOBMFF
Several types of stream access points (SAPs) have been specified. SAP Type 1 corresponds to what is known in some coding schemes as a “Closed group of pictures (GOP) random access point” (in which all pictures, in decoding order, can be correctly decoded, resulting in a continuous time sequence of correctly decoded pictures with no gaps) and in addition the first picture in decoding order is also the first picture in presentation order. SAP Type 2 corresponds to what is known in some coding schemes as a “Closed GOP random access point” (in which all pictures, in decoding order, can be correctly decoded, resulting in a continuous time sequence of correctly decoded pictures with no gaps), for which the first picture in decoding order may not be the first picture in presentation order. SAP Type 3 corresponds to what is known in some coding schemes as an “Open GOP random access point”, in which there may be some pictures in decoding order that cannot be correctly decoded and have presentation times less than intra-coded picture associated with the SAP.
A stream access point (SAP) sample group as specified in ISOBMFF identifies samples as being of the indicated SAP type.
A sync sample may be defined as a sample corresponding to SAP type 1 or 2. A sync sample can be regarded as a media sample that starts a new independent sequence of samples; if decoding starts at the sync sample, it and succeeding samples in decoding order can all be correctly decoded, and the resulting set of decoded samples forms the correct presentation of the media starting at the decoded sample that has the earliest composition time. Sync samples can be indicated with the SyncSampleBox (for those samples whose metadata is present in a TrackBox) or within sample flags indicated or inferred for track fragment runs.
Items in ISOBMFF
Files conforming to the ISOBMFF may contain any non-timed objects, referred to as items, meta items, or metadata items, in a MetaBox ‘meta’, which may also be called MetaBox. While the name of the meta box refers to metadata, items can generally contain metadata or media data. The meta box may reside at the top level of the file, within a MovieBox ‘moov’, and within a TrackBox ‘trak’, but at most one meta box may occur at each of the file level, movie level, or track level. The meta box may be required to contain a HandlerReferenceBox ‘hdlr’ indicating the structure or format of the MetaBox ‘meta’ contents. The MetaBox may list and characterize any number of items that can be referred and each one of them can be associated with a file name and can be uniquely identified with the file by item identifier (item_id) which is an integer value. The metadata items may be for example stored in ItemDataBox 'idat' of the MetaBox or in an 'mdat' box or reside in a separate file. If the metadata is located external to the file, then its location may be declared by the DatalnformationBox ‘dinf’. In the specific case that the metadata is formatted using extensible Markup Language (XML) syntax and is required to be stored directly in the MetaBox, the metadata may be encapsulated into either the XMLBox ‘xml‘ or the BinaryXMLBox ‘bxml’. An item may be stored as a contiguous byte range, or it may be stored in several extents, each being a contiguous byte range. In other words, items may be stored fragmented into extents, e.g., to enable interleaving. An extent is a contiguous subset of the bytes of the resource, and the resource can be formed by concatenating the extents.
High Efficiency Image File Format (HEIF) is a standard developed by the Moving Picture Experts Group (MPEG) for storage of images and image sequences. Among other things, the standard facilitates file encapsulation of data coded according to the High Efficiency Video Coding (HEVC) standard. HEIF includes features building on top of the used ISO Base Media File Format (ISOBMFF).
The ISOBMFF structures and features are used to a large extent in the design of HEIF. The basic design for HEIF comprises that still images are stored as items and image sequences are stored as tracks. In the context of HEIF, the following boxes may be contained within the root-level 'meta' box and may be used as described hereinafter. In HEIF, the handler value of the Handler box of the 'meta' box is 'pict'. The resource (whether within the same file, or in an external file identified by a uniform resource identifier) containing the coded media data is resolved through the Data Information Box 'dinf , whereas the ItemLocationBox 'Hoc’ box stores the position and sizes of every item within the referenced file. The ItemReferenceBox 'iref documents relationships between items using typed referencing. If there is an item among a collection of items that is in some way to be considered the most important compared to others, then this item is signaled by the PrimaryltemBox 'pitm' . Apart from the boxes mentioned here, the 'meta' box is also flexible to include other boxes that may be necessary to describe items.
Any number of image items can be included in the same file. Given a collection of images stored by using the 'meta' box approach, certain relationships may be qualified between images. Examples of such relationships include indicating a cover image for a collection, providing thumbnail images for some or all of the images in the collection, and associating some or all of the images in a collection with an auxiliary image such as an alpha plane. A cover image among the collection of images is indicated using the 'pitm' box. A thumbnail image or an auxiliary image is linked to the primary image item using an item reference of type 'thmb' or 'auxl', respectively.
The ItemPropertiesBox enables the association of any item with an ordered set of item properties. Item properties are small data records. The ItemPropertiesBox consists of two parts: an ItemPropertyContainerBox that contains an implicitly indexed list of item properties, and one or more ItemPropertyAssociationBox(es) that associate items with item properties. An item property is formatted as a box.
A descriptive item property may be defined as an item property that describes rather than transforms the associated item. A transformative item property may be defined as an item property that transforms the reconstructed representation of the image item content.
An entity group is a grouping of items, which may also group tracks. The entities in an entity group share a particular characteristic or have a particular relationship, as indicated by the grouping type. Entity groups are indicated in GroupsListBox. Entity groups specified in GroupsListBox of a file-level MetaBox refer to tracks or file-level items. Entity groups specified in GroupsListBox of a movie-level MetaBox refer to movie-level items. Entity groups specified in GroupsListBox of a track-level MetaBox refer to track-level items of that track. When GroupsListBox is present in a file-level MetaBox, there is no item_ID value in ItemlnfoBox in any file-level MetaBox that is equal to the trackJD value in any TrackHeaderBox.
GroupsListBox contains EntityToGroupBoxes, each specifying one entity group. The four-character box type of EntityToGroupBox denotes a defined grouping type.
Introduction of mesh coding into MPEG requires new data structures to support compressed formats of connectivity and attribute data. One potential compressed mesh format is Draco bitstream, which is widely used in computer graphics industry and used as an example when discussing details of the present embodiments. Draco bitstream does not contain means for storing texture information inside the bitstream themselves, which means that if a vendor wishes to distribute Draco compressed bitstreams, separate texture file(s) would be needed. This means that other standards such as gITF are used to carry Draco compressed bitstreams along with the texture file(s).
ISOBMFF is an MPEG systems standard which defines how compressed bitstreams can be stored in a single file and provides the requested temporal synchronization features for timed data. It does not currently support storage of Draco compressed bitstreams. Should MPEG proceed with Draco compressed bitstreams in mesh coding, their storage is ISOBMFF may become topical. Furthermore, even without MPEG mesh compression work, there might be a value in providing a single file storage for Draco compressed bitstreams and the associated texture data.
The present embodiments relate to a storage of timed and non-timed Draco compressed bitstreams in ISOBMFF file along with the associated texture image(s) or video(s). New signalling is provided to indicate that a file contains Draco compressed data.
Depending on the encapsulation mode regarding dynamic data the file may contain: one or more tracks with Draco bitstream and zero or more related texture tracks, or - one or more tracks with Draco connectivity sub-bitstreams, one or more tracks with Draco attribute sub-bitstreams, and zero or more related texture tracks or items.
The different designs of present embodiments are represented in Figure 11 , Figure 12 and Figure 13.
Encapsulation of static data in ISOBMFF is done using items. Each file can contain one or more items with full Draco bitstream and zero or more related texture items. Item based storage is illustrated in Figure 14.
Several new syntax elements and boxes are disclosed to enable the encapsulation of Draco bitstreams in ISOBMFF. Such include:
- Draco Header Box
- Draco Metadata Box
- Draco Connectivity Header
- Draco Attribute Header
- Draco Decoder Configuration Record
- Draco Sample Entry
- Draco Attribute Sample Entry
- Draco Connectivity Data and Draco Attribute Data
- Draco Item Property
- Draco Item Data
- Object Transformation Box
For the purpose of the present disclosure, the terminology illustrated in Figure 10 is applied. Draco bitstream comprises Draco header, Draco metadata, Draco connectivity header, Draco connectivity sub-bitstream, Draco attribute header, and one or more Draco attribute sub-bitstreams.
Draco bitstream can be either static or timed, which means that different mechanisms for ISOBMFF encapsulation need to be considered. Generally, timed data is stored in tracks and non-timed data is stored as items. These principles are respected and thus syntax elements for both encapsulations are considered. The encapsulated data is the same Draco bitstream, thus defining common syntax elements and boxes to be used on both timed and non-timed encapsulation is considered. Common structures and boxes
The first part of the following disclosure focuses on defining syntax elements that can be shared between different encapsulation modes in ISOBMFF. With respect to this, DracoHeaderBox is defined as follows aligned (8) class DracoHeaderBox ( ) extends FullBox ( 'xxxx' , version = 0, 0) { unsigned char (8) draco string[5] = "DRACO"; unsigned int(8) major version; unsigned int(8) minor version; unsigned int (8) encoder_type ; unsigned int (8) encoder method; unsigned int (16) flags;
}
‘xxxx’ will be unique four-character code identifier for Draco header box. draco_string must be equal to “DRACO”. major_version indicates the major version number of the bitstream. minor_version indicates the minor version number of the bitstream. encoder_type indicates if the content has been encoded as point clouds or as triangular mesh. encoder_type equal to 0 indicates that the content is point clouds and encoder_type equal to 1 indicates a triangular mesh. encoder_method indicates the encoding method of the bitstream. encoder_method equal to 0 indicates sequential encoding and encoder_method equal to 1 indicates edgebreaker encoding method, flags field contains 16-bits for signaling flags, e.g., flag indicating the presence of metadata in the bitstream users mask 32767.
Due to the recursive design of metadata in Draco bitstream a new DracoMetadtaBox and DracoMetadata structure are designed to encapsulate the bitstream metadata. These are defined as follows: aligned(8) struct DracoMetadata () { unsigned int (32) num entries; for (i = 0; i < num entries; ++i) { unsigned int(8) metadata key size; unsigned int(8) key [metadata keysize] ; unsigned int(8) metadata value size; unsigned int (8) value [metadata value size] ;
} unsigned int(32) num sub metadata; for (i = 0; i < num sub metadata; ++i) { unsigned int(8) sub metadata key size; unsigned int(8) sub_metadata_key [ sub_metadata key_size] ; DracoMetadata sub_metadata ( ) ;
}
} num_entries define how many metadata key-value pairs of data the structure contains. metadata_key_size provides the size of the array that holds the key for the metadata, key contains the key for the metadata. metadata_value_size contains the size of the array that holds the value of the metadata, value contains the value for the metadata. num_sub_metadata provides the number of sub metadata elements. sub_metadata_key_size provides the size of the array that holds the key for sub metadata. sub_metadata_key holds the key for the sub metadata. sub_metadata contains the recursive next level of the sub metadata. aligned ( 8 ) clas s DracoMetadataBox ( ) extends FullBox ( ' yyyy' , version = 0 , 0 ) { i f ( draco header . flags & 32767 == 0 ) { unsigned int ( 32 ) num attr metadata; for ( i = 0 ; i < num attr metadata; ++i ) { unsigned int ( 32 ) attr metadata id; DracoMetadata attr_metadata ;
}
DracoMetadata file metadata ;
}
}
“yyyy” will be the unique four-character code identifier for the Draco metadata box. num_attr_metadata provides the number of attribute specific metadata structures. attr_metadata_id indicates the identifier for the attribute with which the metadata is associated with. attr_metadata contains the metadata for the attribute. file_metadata contains file level metadata for the bitstream.
In addition to header and metadata, the compressed Draco bitstream consist of two parts: connectivity data and attribute data. Connectivity defines how vertices are connected to each other to form a mesh. Attribute data defines attributes that are associated with said vertices, like position, color or normals. In the bitstream this data is represented in compressed form with related headers.
Connectivity header depends on the type of encoder method, which can be passed as a parameter from the DracoHeaderBox, and can be represented with the structure as follows: aligned ( 8 ) struct DracoConnectivityHeader (unsigned int ( 8 ) encoder method) { i f ( encoder method == 0 ) { unsigned int ( 32 ) max num faces ; unsigned int ( 32 ) max num points ; unsigned int ( 8 ) connectivity method; } else if (encoder method == 1) { unsigned int(8) edgebreaker traversal type; unsigned int(32) max num encoded vertices; unsigned int (32) max num faces; unsigned int (8) num attribute data; unsigned int (32) num encoded symbols; unsigned int(32) num encoded split symbols;
}
}
In case of sequential encoding method, i.e., encoder_method equal to 0, max_num_faces contains the maximum number of encoded faces in the bitstream. max_num_points contains the maximum number of encoded points in the bitstream. connectivity_method indicates if the sequentially coded indices are compressed or uncompressed. connectivity_method equal to 0 indicates that the indices are compressed, whereas connectivity method equal to 1 indicates that the indices are uncompressed.
In case of edgebreaker-based encoding method, i.e., encoder_method equal to 1, edgrebreaker_traversal_type indicates the traversal type of edgebreaker. edgebreaker_traversal_type equal to 0 indicates standard edgebreaker traversal, equal to 1 indicates predictive edgebreaker traversal and equal to 2 indicates valence based edgebreaker traversal. max_num_encoded_vertices indicates the maximum number of encoded vertices. max_num_faces indicates the maximum number of encoded faces. num_attribute_data indicates the number of encoded attributes. num_encoded_symbols indicates the number of encoded edgebreaker symbols. num_encoded_split_symbols indicates the number of encoded edgebreaker split symbols.
The information in the connectivity header provides client application an overview of the content in the bitstream. The information regarding the maximum number of faces as points as well as the connectivity type indicate to the client what kind of resources are required to play back the content.
Draco attribute header provides information about the encoded attribute information in the bitstream. It can be defined as follows: aligned (8) struct DracoAttributeHeader (unsigned int(8) encoder method) { unsigned int (8) num attributes decoders; if (encoder method == 1 ) { for (i = 0; i < num attributes decoders; ++i) { unsigned int(8) att_dec_data_id; unsigned int(8) att_dec_decoder_type; unsigned int(8) att dec traversal method;
}
} for (i = 0; i < num attributes decoders; ++i) { unsigned int(32) att dec num attributes; for (j = 0; j < att_dec_num_attributes ; ++ j ) { unsigned int(8) att_dec_att_type; unsigned int(8) att_dec_data_type ; unsigned int(8) att dec num components; unsigned int(8) att dec normalized; unsigned int(32) att dec unique id;
} for (j = 0; j < att_dec_num_attributes ; ++ j ) { unsigned int(8) seq_att_dec_decoder_type;
}
}
} aligned (8) class DracoAttributeHeaderBox (unsigned int(8) encoder method) extends FullBox ('zzzz' , version = 0, 0)
{
DracoAttributeHeader attribute header (encoder method) ;
} num_attributes_decoders indicates the number of attributes in the attribute sub- bitstream. att_dec_data_id indicates the decoder identifier for a given attribute decoder. att_dec_decoder_type indicates the type of the attribute decoder. att_dec_decoder_type equal to 0 indicates mesh vertex attribute decoder and equal to 1 indicates mesh corner attribute decoder. att_dec_traversal_method indicates the traversal method for the encoder. att_dec_traversal_method equal to 0 indicates depth-first traversal and equal to 1 indicates prediction degree-based traversal method. att_dec_num_attributes indicates the number of attributes per attribute to be decoded. att_dec_att_type indicates the type of the attribute, which can be position, UV-coord inate, normal or other per vertex associated data. att_dec_data_type indicates the attribute data type, which can be a floating point UINT8 or similar. att_dec_num_components indicates the component count for the attribute, i.e., the number of components that represent the attribute. For a UV- coordinate attribute type the component count would be two. att_dec_normalized indicates if the attribute represents normalized data. att_dec_unique_id indicates the unique decoded id of the attribute. seq_att_dec_decoder_type indicates the sequential decoder, equal to 1 indicates integer decoder, equal to 2 indicates quantization decoder and equal to 3 indicates normal decoder. The DracoConnectivityHeader and DracoAttributeHeader contain important information that is needed to configure the draco_decoder. For different decoder implementations it may make sense to expose this information in a place, where file parser can easily access it. This allows an implementation to fail early if the bitstream contains technologies that are not supported. Header information also provide general information about the content such as maximum number of vertices, which can be useful for the client to decide if the content can be rendered in realtime or if adaptation is required. For this purpose, a DracoDecoderConfigurationRecord is defined. aligned ( 8 ) struct DracoDecoderConf igurationRecord ( ) { unsigned int ( l ) connectivity header inband; unsigned int ( l ) attribute header inband; bit ( 6 ) reserved; DracoConnectivityHeader connectivity header ; DracoAttributeHeader attribute header ;
} connectivity_header_inband indicates if connectivity header(s) are stored as part of the encoded connectivity sub-bitstream. When connectivity_header_inband equals to 0, the connectivity header is extracted from the bitstream and only present in the DracoDecoderConfigurationRecord. When connectivity_header_inband equals to 1 , the connectivity header is present both in the DracoDecoderConfigurationRecord and in the encoded connectivity sub-bitstream, in which case the connectivity header information in the bitstream takes precedence over the information in the DracoDecoderConfigurationRecord. For a track, which contains DracoDecoderConfigurationRecord this means that the number of points in the samples can be different from the number of points signaled in the configuration record, enabling encapsulation of bitstreams where the number of vertices change during the sequence. attribute_header_inband indicates if attribute header(s) are stored as part of the encoded bitstream. The information in the attribute header is unlikely to change during the sequence, but the flag enables easy injection of attribute data in ISOBMFF by preserving the header data in the bitstream. When attribute_header_inband equals to 0, the header(s) are not part of the encoded bitstream and when attribute_header_inband equals to 1 , the header(s) are expected in the encoded bitstream as well as in the decoder configuration record. In the latter case, the header information in the bitstream takes precedence over the header information in the DracoDecoderConfigurationRecord. As is the practice with configuration records, DracoDecoderConfigurationRecord is wrapped in a dedicated ISOBMFF box DracoConfigurationBox, which is defined as follows: aligned ( 8 ) class DracoConf igurationBox ( ) extends FullBox ( ' zzzz' , version = 0 , 0 ) {
Draco De coder Conf igurationRecord configuration ;
} where ‘zzzz’ is a unique four-character code identifier for Draco configuration box. The configuration contains exactly one DracoDecoderConfigurationRecord.
Structures for encoded connectivity data and attribute data are defined, for use in sample and item data definitions. Both DracoConnectivityData and DracoAttributeData contain the non-modified sub-bitstreams for connectivity and attribute information correspondingly, which means that the information from connectivity header and attribute header is duplicated in the data definitions. This is to present non-modified bitstream directly to the decoder with no modifications. Furthermore, the storage of attribute headers and connectivity headers is not mandated. The encoded connectivity data shall be defined as follows: aligned ( 8 ) struct DracoConnectivityDat ( ) {
/ / si ze i s made available through ISOBMFF functionality i f ( DracoDecoderConf igurationRecord . connectivity header inband == 1 ) { char ( 8 ) connectivity_data [ size ] ;
} else if ( DracoHeader . encoder method == 0 ) { i f ( DracoConnectivityHeader . connectivity method == 0 ) { char ( 8 ) sequential compres sed indices [ size ] ;
} else if ( DracoConnectivityHeader . connectivity method == 1 ) { char ( 8 ) sequential uncompres sed indices [ size ] ;
}
} else if ( DracoHeader . encoder method == 1 ) { char ( 8 ) edgebreaker connectivity data [ size ] ;
}
}
The connectivity_data byte array shall contain exactly one DecodeConnectivityData element as defined in Draco bitstream specification, which means that the header data is included in the array. sequential_compressed_indices byte array shall contain exactly one DecodeSequentialCompressedlndices as defined in Draco bitstream specification. sequential_uncompressed_indices shall contain exactly one DecodedSequentiallndices as defined in Draco bitstream specification. edgebreaker_connectivity_data byte array shall contain exactly one of each DecodeTopologySplitEvents, EdgebreakerTraversalStart,
DecodeEdgeBreakerConnectivity syntax elements, as defined in Draco bitstream specification, in this explicit order.
The encoded attribute data shall be accordingly stored as follows: aligned ( 8 ) struct DracoAttributeDat ( ) {
/ / si ze i s made available through ISOBMFF functionality i f ( attribute header inband == 1 ) { char ( 8 ) attribute data including header [ size ] ;
} else if ( attribute_header_inband == 0 ) { char ( 8 ) attribute data excluding header [ si ze ]
}
}
The attribute_data_including_header byte array shall contain exactly one DecodeAttributeData element as defined in Draco bitstream definition, which means that the attribute header data is included. attribute_data_excluding_header byte array shall contain exactly one DecoderAttributeData element excluding ParseAttributeDecodersData element as defined in Draco bitstream specification. This means that the attribute data does not contain attribute header.
Storage of Draco bitstream in a track
According to an embodiment, when storage of dynamic Draco bitstreams is considered, the data can be encapsulated in single-track or multi-track mode. Figure 11 illustrates the single-track encapsulation mode where Draco bitstream is encapsulated in a single track referred to as single-stream Draco track. The singlestream Draco track intends to preserve the bitstream as is, but provide useful information about the bitstream for the file parser.
Single-stream Draco track requires defining a new sample entry and sample format. The DracoSampleEntry can be defined as follows: aligned ( 8 ) class DracoSampleEntry ( ) extends
VolumetricVisualSampleEntry ( 'aaaa' , version = 0 , 0 ) { DracoHeaderBox draco_header ;
DracoMetadataBox draco_metadata ;
DracoConfigurationBox configuration ;
/ / optional boxes
} where ‘aaaa’ represents a unique four-character code identifier for Draco sample entry. Single-stream Draco track requires that the connectivity and attribute headers are stored along with the bitstream. The information from the header information is also copied in the Draco configuration record, which sets following restrictions for the sample entry:
• DracoConfigurationRecord.connectivity_header_inband shall be equal to 1
• DracoConfigurationRecord.attribute_header_inband shall be equal to 1
The samples of Draco track are defined as follows: aligned ( 8 ) class DracoSample ( ) {
/ / sample_si ze value is the si ze of the sample from the S ample Si zeBox char draco_payload [ sample_si ze ] ;
}
Draco_payload byte array contains data representing a single element of DecodeConnecticityData and DecodeAttributeData as defined in Draco bitstream specification. The structures explicitly contain the header data for connectivity and attribute. The sample_size information is provided by the SampleSizeBox in ISOBMFF:
All samples in Draco track can be marked as sync samples as there is no inter prediction between samples. Decoder can take a sample entry information and any of the sample in a track to decode a mesh or point cloud.
The ISOBMFF file may carry additional texture attribute information that was compressed as video, e.g., color information corresponding to a Draco compressed mesh. The video compressed texture data can be stored as samples of another video track. In such scenario, a file format should inform a file parser how to link those tracks. For this purpose, a track group may be used. Setting the track group identification of the single-stream Draco track and the video track means that the video track contains texture information that can be consumed together with the Draco bitstream.
According to another embodiment, the single-stream Draco track may consist of sub-samples or sample groups, which allow defining byte-ranges of sample data that correspond to connectivity sub-bitstream or individual attribute sub-bitstreams. This encapsulation is illustrated in Figure 12. The sample entry for such encapsulation may be defined as earlier, with the following restrictions:
• DracoConfigurationRecord.attribute_header.inband must be set to 0, to indicated that the attribute headers are not part of the sample data.
• DracoConfigurationRecorder.connectivity_header_inband must be set to 1 .
Sub-sample definitions or sample groups for connectivity sub-bitstream and to individual attribute-sub-bitstreams indicate where such data is located Both sub- sample-based and sample group -based signaling must utilize att_dec_unique_id, which indicates the identification of the attribute stored in each sub-sample of sample group.
The benefit of this encapsulation mode is to enable selective access to specific attributes of the Draco bitstream. This allows client implementation to selectively decode only the attribute information that it requires.
Samples that belong to connectivity sample group or connectivity sub-samples shall use the following sample format: aligned ( 8 ) clas s DracoConnectivitySample ( ) {
/ / in case of sample groups , the si ze i s derived from SampleSi zeBox
/ / in case of sub- samples the size of the sample is derived from
/ / subsample si ze field in SubSamplelnf ormationBox DracoConnectivityData connectivity data [ size ] ;
}
The samples that belong to attribute sample group or sub-samples shall use the following sample format: aligned ( 8 ) clas s DracoAttributeSample ( ) {
/ / in case of sample groups , the si ze i s derived from SampleSi zeBox
/ / in case of sub- samples the size of the sample is derived from
/ / subsample si ze field in SubSamplelnf ormationBox DracoAttributeData attribute_data [ size ] ;
}
According to another embodiment, Draco bitstream can be stored in multiple tracks. One track is used to store connectivity data and general configurations, whereas new tracks are used to store individual attributes. This encapsulation is illustrated in Figure 13.
The track(s), which contain the connectivity data, shall be referred to as Draco connectivity track(s), whereas the tracks containing the attribute data shall be referred to as Draco attribute tracks. The Draco connectivity track shall contain a DracoSampleEntry as described earlier in this specification, and in addition, a track reference box, which contains track references to all Draco attribute tracks. New four-character code for the track references can be defined. For that, the present disclosure use ‘drat’ as an example.
Each Draco attribute track shall contain a Draco attribute sample entry, which can be described as defined below: aligned ( 8 ) clas s DracoAttributeSampleEntry ( ) extends VolumetricVisualSampleEntry ( 'bbbb ' , version = 0 , 0 ) {
Dr a co t tribute Conf igurationBox configuration ;
DracoAttributeHeaderBox attribute header ; / / optional } where ‘bbbb’ represents a unique identifier for the Draco attribute sample entry. att_dec_unique_id indicates the identifier for the attribute information in the track. Together with the attribute header information of the Draco connectivity track, it can be used to decode the attribute information in the track.
Draco attribute sample entry may contain an optional Draco attribute header box, which stores only attribute information related to the att_dec_unique_id.
Storage of Draco bitstream as item
When Draco bitstream contains static data, i.e. , it does not change temporally, it can be stored as an item in ISOBMFF. Figure 14 illustrates the encapsulation.
When Draco bitstream is stored as non-timed item a new item property is defined along with item data, item references or entity groups. Figure 14 illustrates encapsulation design using the entity box. The item property can be defined as follows: aligned ( 8 ) clas s DracoItemProperty extends FullItemProperty ( ' cccc ' , version=0 , flags ) { DracoHeaderBox draco header ; DracoMetadataBox draco metadata ;
Draco Conf igur at ionBox configuration ;
/ / Optional boxes
}
} where ‘aaaa’ represents a unique four-character code identifier for Draco item property.
The Draco item property is associated with the relevant item with using item property association box. The location of the item is indicated by the item location box. The item data for Draco bitstream can be defined as follows: aligned ( 8 ) clas s DracoItemData ( ) {
/ / item si ze value i s equal to the sum of the extent length values of
/ / each extent of the item, as specified in the ItemLocationBox char ( 8 ) draco_payload [ item_size ] ;
} draco_payload byte array contains data representing a single element of DecodeConnectivityData and DecodeAttributeData as specified in Draco bitstream specification.
Alternatively, the item data may be defined using DracoConnecitivityData and DracoAttributeData definitions as proposed earlier in this disclosure, if further subdivision and access to item data is desired. aligned ( 8 ) clas s DracoItemData ( ) {
DracoConnectivityData connectivity data [ connectivity size ] DracoAttributeData attribute_data [ attribute_si ze ] } where the connectivity and attribute size and location would be provided by separate values of extent_offset, extentjength signaled in ItemLocationBox ‘Hoc’. It can be enforced that extent_count shall be equal to 2 and the first extent indicates the position of connectivity data, and the second extent indicates the position of attribute data.
To link Draco bitstream with optional image items, entity grouping may be considered. When an entity group identifies both Draco item and image item, it shall be assumed that the image item contains texture data that can be consumed together with the Draco compressed object. Alternatively, item references can be defined linking Draco item to the image item.
Additional functionality
When a file contains multiple Draco items or tracks, it may be beneficial to provide object transformations for each object describing the relative positioning, orientation, and scaling between the objects. For this purpose, a new box can be defined as described below: aligned ( 8 ) clas s Obj ectTrans formation ( ) { float ( 32 ) position [ 3 ] ; float ( 32 ) rotation_quat [ 3 ] ; float ( 32 ) scale [ 3 ] ;
} aligned ( 8 ) class DracoMetadataBox ( ) extends FullBox ( ' yyyy' , version = 0 , 0 ) {
Obj ectTrans formation transformation ;
}
Object transformation box may be placed in Draco sample entry or Draco item property.
The method according to an embodiment is shown in Figure 15. The method generally comprises receiving 1510 media data comprising three-dimensional models being formed of meshes or point clouds; compressing 1520 a three- dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and storing 1530 the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams. Each of the steps can be implemented by a respective module of a computer system.
An apparatus according to an embodiment comprises means for receiving media data comprising three-dimensional models being formed of meshes or point clouds; means for compressing a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and means for storing the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 15 according to various embodiments.
An example of an apparatus is disclosed with reference to Figure 16. Figure 16 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device. The apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a IIICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system, or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.
The various embodiments may provide advantages. For example, the present embodiments enable distribution of Draco compressed 3D assets with ISOBMFF and leverage the functionality offered by ISOBMFF such as temporal random access The present embodiments focus on minimal processing requirements to extract original bitstream from ISOBMFF, thus enabling efficient implementations. General information about the bitstream is exposed in ISOBMFF structures that allow client applications to quickly allocate decoding resources or fail fast if a feature of Draco bitstream is not supported by the client. This can be done before any bitstream parsing is initiated.
The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments.
A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.
Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:
1 . An apparatus comprising:
- means for receiving media data comprising three-dimensional models being formed of meshes or point clouds;
- means for compressing a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and
- means for storing the compressed geometry bitstreams in a box- structured file format with associated texture bitstreams.
2. The apparatus according to claim 1 , further comprising means for indicating in a box that a file contains bitstreams being compressed with the algorithm for mesh compression.
3. The apparatus according to claim 1 or 2, further comprising means for including one or more of the following boxes relating to the compressed geometry bitstreams into the box-structured filed format:
- decoder configuration record to indicate information to configure a decoder;
- sample entry to indicate dynamic bitstream in a single-track encapsulation;
- attribute sample entry to indicate attribute information of a track;
- item property to indicate a static data item of the compressed bitstream;
- item data to indicate a location of the static data item;
- object transformation to indicate relative positioning, orientation, and scaling between objects.
4. The apparatus according to claim 1 or 2 or 3, wherein the algorithm is a Draco compression algorithm.
5. The apparatus according to any of the claims 1 to 4, wherein a file comprises one or more tracks with geometry bitstream and zero or more related texture tracks.
6. The apparatus according to any of the claims 1 to 4, wherein the file comprises one or more tracks with connectivity sub-bitstreams, one or more tracks with attribute sub-bitstreams, and zero or more related texture tracks.
7. The apparatus according to any of the claims 1 to 6, wherein the file is an ISOBMFF file.
8. The apparatus according to any of the claims 1 to 7, wherein timed bitstream is stored in tracks in the file.
9. The apparatus according to any of the claims 1 to 8, wherein non-timed bitstream is stored as items in the file.
10. A method, comprising:
- receiving media data comprising three-dimensional models being formed of meshes or point clouds;
- compressing a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and
- storing the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
11 . The method according to claim 10, further comprising indicating in a box that a file contains bitstreams being compressed with the algorithm for mesh compression.
12. The method according to claim 10 to 11 , wherein a file comprises one or more tracks with geometry bitstream and zero or more related texture tracks.
13. The method according to claim 10 to 11 , wherein the file comprises one or more tracks with connectivity sub-bitstreams, one or more tracks with attribute sub-bitstreams, and zero or more related texture tracks. The method according to any of the claims 10 to 13, wherein timed bitstream is stored in tracks in the file and wherein non-timed bitstream is stored as items in the file. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: - receive media data comprising three-dimensional models being formed of meshes or point clouds;
- compress a three-dimensional model with an algorithm suited for a compression of meshes or point clouds to provide one or more compressed geometry bitstreams, each geometry bitstream comprising a header, metadata, connectivity data and attributes data; and
- store the compressed geometry bitstreams in a box-structured file format with associated texture bitstreams.
PCT/FI2022/050834 2022-01-27 2022-12-14 A method, an apparatus and a computer program product for video coding WO2023144439A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20225066 2022-01-27
FI20225066 2022-01-27

Publications (1)

Publication Number Publication Date
WO2023144439A1 true WO2023144439A1 (en) 2023-08-03

Family

ID=87470874

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2022/050834 WO2023144439A1 (en) 2022-01-27 2022-12-14 A method, an apparatus and a computer program product for video coding

Country Status (1)

Country Link
WO (1) WO2023144439A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210105492A1 (en) * 2019-10-02 2021-04-08 Nokia Technologies Oy Method and apparatus for storage and signaling of sub-sample entry descriptions

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210105492A1 (en) * 2019-10-02 2021-04-08 Nokia Technologies Oy Method and apparatus for storage and signaling of sub-sample entry descriptions

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"A thesis submitted to the Delft University of Technology in partial fulfillment of the requirements for the degree of Master of Science in Geomatics TU Delft", 2 July 2020, TU DELFT, NL, article VAN LIEMPTE JORDI: " CityJSON: does (file) size matter?", pages: 1 - 138, XP093083363 *
"INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC 1/SC 29/WG 2 MPEG TECHNICAL REQUIREMENTS - 136th meeting", 1 October 2021, ISO, article MPEG: "CfP for Dynamic Mesh Coding", pages: 1 - 38, XP093083359 *

Similar Documents

Publication Publication Date Title
US20220116659A1 (en) A method, an apparatus and a computer program product for volumetric video
WO2020012073A1 (en) Method and apparatus for storage and signaling of compressed point clouds
CN112019857A (en) Method and apparatus for storage and signaling of compressed point clouds
EP4131961A1 (en) Device for transmitting point cloud data, method for transmitting point cloud data, device for receiving point cloud data, and method for receiving point cloud data
CN114930813B (en) Point cloud data transmitting device, point cloud data transmitting method, point cloud data receiving device and point cloud data receiving method
CN114747219A (en) Method and apparatus for storing and signaling sub-sample entry descriptions
US20230050860A1 (en) An apparatus, a method and a computer program for volumetric video
WO2020070379A1 (en) Method and apparatus for storage and signaling of compressed point clouds
CN115398890B (en) Point cloud data transmitting device, point cloud data transmitting method, point cloud data receiving device and point cloud data receiving method
US11711535B2 (en) Video-based point cloud compression model to world signaling information
WO2021260266A1 (en) A method, an apparatus and a computer program product for volumetric video coding
WO2021191495A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2023144445A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
US20230129875A1 (en) A method, an apparatus and a computer program product for volumetric video encoding and video decoding
WO2023144439A1 (en) A method, an apparatus and a computer program product for video coding
WO2021205068A1 (en) A method, an apparatus and a computer program product for volumetric video coding
WO2021191500A1 (en) An apparatus, a method and a computer program for volumetric video
US20230171427A1 (en) Method, An Apparatus and a Computer Program Product for Video Encoding and Video Decoding
US20220292763A1 (en) Dynamic Re-Lighting of Volumetric Video
WO2023175243A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2023047021A2 (en) A method, an apparatus and a computer program product for video encoding and video decoding
US11974026B2 (en) Apparatus, a method and a computer program for volumetric video
EP3873095A1 (en) An apparatus, a method and a computer program for omnidirectional video
WO2023041838A1 (en) An apparatus, a method and a computer program for volumetric video
WO2023002315A1 (en) Patch creation and signaling for v3c dynamic mesh compression

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923690

Country of ref document: EP

Kind code of ref document: A1