WO2021176133A1 - An apparatus, a method and a computer program for volumetric video - Google Patents

An apparatus, a method and a computer program for volumetric video Download PDF

Info

Publication number
WO2021176133A1
WO2021176133A1 PCT/FI2021/050110 FI2021050110W WO2021176133A1 WO 2021176133 A1 WO2021176133 A1 WO 2021176133A1 FI 2021050110 W FI2021050110 W FI 2021050110W WO 2021176133 A1 WO2021176133 A1 WO 2021176133A1
Authority
WO
WIPO (PCT)
Prior art keywords
bitstream
video
volumetric video
components
encoded
Prior art date
Application number
PCT/FI2021/050110
Other languages
French (fr)
Inventor
Lauri Aleksi ILOLA
Sebastian Schwarz
Lukasz Kondrad
Emre Baris Aksu
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2021176133A1 publication Critical patent/WO2021176133A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/30Image reproducers
    • H04N13/388Volumetric displays, i.e. systems where the image is built up from picture elements distributed through a volume
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards

Definitions

  • the present invention relates to an apparatus, a method and a computer program for volumetric video compression.
  • BACKGROUND [0002]
  • 3DoF+ 3 degrees-of-freedom and greater
  • Volumetric video compression typically segments the 3D content into a set of two dimensional (2D) patches containing color and geometry data, which can then be compressed using a standard 2D video compression format.
  • volumetric video compression is currently typically being explored and standardized in the MPEG-I Point Cloud Compression (PCC) and 3DoF+ efforts, for example.
  • PCC MPEG-I Point Cloud Compression
  • 3DoF+ volumetric video compression can use 3D scene segmentation to generate view that can be packed into atlases and efficiently encoded using existing 2D compression technologies such as H.265 or H.264.
  • a standard metadata format may need to be defined that efficiently describes information required for view synthesis.
  • the current video-based point cloud compression (V-PCC) specification defines similar metadata structures while setting limits to patch packing strategies by defining shared patch layouts for all components (color, depth, etc.) of volumetric video.
  • V-PCC (23090-5), MIV (23090-12) specifications don’t accommodate flexibility to define different patch layouts per video encoded component and encapsulation of V-PCC (23090- 10) does not consider that each video encoded component can have its own metadata (atlas/patch layout) information. As such signalling and storage to enable separation of patch layouts is yet to be explored. Therefore, there is also an ongoing need for approaches for applying different packing methods for each component of volumetric video that defines accompanying metadata format that supports view reconstruction by client devices.
  • SUMMARY [0006] Now, an improved method and technical equipment implementing the method has been invented, by which the above problems are alleviated.
  • a new 3VC specific SEI (supplementary enhancement information) message for V-PCC bitstream may be used, such as a separate_atlas_component().
  • the SEI message may be inserted in a NAL stream (referred to as well as atlas bitstream) signalling which component the following or preceding NAL units are applied to.
  • the SEI message may be defined as a prefix or a suffix.
  • NAL units are applied to all related video encoded components. [0009] This kind of design may provide flexibility to signal per component NAL units, which may enable signalling different patch layouts and parameter sets for each video encoded component.
  • the new SEI message may contain at least a component_type attribute as well as an attribute_type attribute.
  • a component_type attribute as well as an attribute_type attribute.
  • Default value for component type could be assigned to indicate that NAL units are applied to all video encoded components.
  • Patch layouts may be signalled in separate tracks of timed metadata per video encoded component describing patch layout.
  • Each layer of atlas contains different patch layout. Each video component or group of video components is assigned to different layer of an atlas (distinguished by nuh_layer_id). Therefore, mapping of atlas layer to video component or group of video components maybe signalled.
  • V-PCC parameter set level V-PCC unit type of VPCC_VPS
  • All the parameter sets have an extensions mechanism that can be utilized to provide such information.
  • An advantage of some embodiments is to improve compression of 3VC V-PCC and 3VC MIV enabling application different packing strategies for different video encoded component types.
  • the patch layout separation provides flexibility to develop and carry novel packing strategies that are not yet known, thus future proofing the related technologies.
  • a method comprising obtaining a presentation comprising volumetric video content; obtaining two or more components of the volumetric video content; packing the two or more components of the volumetric video content into separate video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signaling, in or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components.
  • An apparatus comprises at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a presentation comprising volumetric video content; obtain two or more components of the volumetric video content; pack the two or more components of the volumetric video content into separate video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signal, in or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components.
  • An apparatus comprises means for: obtaining a presentation comprising volumetric video content; obtaining two or more components of the volumetric video content; packing the two or more components of the volumetric video content into separate video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signaling, in or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components.
  • a method comprises receiving a bitstream in a decoder, said bitstream comprising an encoded presentation of two or more components of a volumetric video content; receiving, from or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components; decoding, from the bitstream, the two or more components of a volumetric video content; depacking the two or more components of the volumetric video content from separate patches by using the settings and patch information.
  • Fig.1a shows an encoder and decoder for encoding and decoding omnidirectional video content according to OMAF standard
  • Fig.1b shows an example of image stitching, projection and region-wise packing
  • Fig.1c shows an example of a process of forming a monoscopic equirectangular panorama picture
  • Figs.2a and 2b show a compression and a decompression process for 3D volumetric video
  • Figs.3a and 3b show an example of a point cloud frame and a projection of points to a corresponding plane of a point cloud bounding box
  • Fig.4 illustrates compression of metadata for volumetric video scene as a video compression pipeline
  • Fig.5 shows a flow chart for signal
  • Fig.8 shows a schematic diagram of an example multimedia communication system within which various embodiments may be implemented.
  • DETAILED DESCRIPTON OF SOME EXAMPLE EMBODIMENTS [0032] In the following, several embodiments of the invention will be described in the context of omnidirectional video coding and point cloud compressed (PCC) objects. It is to be noted, however, that the invention is not limited to PCC objects.
  • Volumetric video data represents a three-dimensional scene or object and can be used as input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications. Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g.
  • Volumetric video is either generated from 3D models, i.e. computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Typical representation formats for such volumetric data are triangle meshes, point clouds, or voxel.
  • CGI computer-generated imagery
  • Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time.
  • volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR, or MR applications, especially for providing 6DOF viewing capabilities.
  • Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used.
  • Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold.
  • Another way to represent 3D data is coding this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.
  • the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance.
  • Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient.2D-video based approaches for compressing volumetric data, i.e. multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities. [0036] Alternatively, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries.
  • volumetric video compression can often generate an array of patches by decomposing the point cloud data into a plurality of patches.
  • the patches are mapped to a 2D grid and, in some instances, an occupancy map is generated from any of a variety of attributes (such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like), where occupied pixels are pixels which have valid attribute values, e.g., depth values and/or color values.
  • Geometry images, texture images and/or the like may then be generated for subsequent storage and/or transmission.
  • the compressed images may thereafter be decompressed and the geometry and texture may be reconstructed, such that the image may then be viewed.
  • a 3D surface is projected onto a 2D grid.
  • the 2D grid has a finite resolution.
  • two or more points of the 3D surface may be projected on the same 2D pixel location.
  • the image generation process exploits the 3D to 2D mapping to store the geometry and texture of the point cloud as images.
  • each patch is projected onto two images, referred to as layers (or maps).
  • the first geometry layer is encoded as it is and the second geometry layer is encoded as a delta to the first layer. Texture frames may be generated similarly, but both texture layer 1 and layer 2 may be encoded as separated texture frames.
  • one approach involves absolute coding with reconstruction correction.
  • Another approach to retain the high frequency features involves geometry-based point interpolation.
  • the compression efficiency of geometry images is improved by replacing some of the geometry information explicitly encoded using geometry images by a point interpolation algorithm.
  • 3D cameras In contrast to traditional 2D cameras enabling to capture a relatively narrow field of view, three-dimensional (3D) cameras are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions).
  • HMD head-mounted displays
  • available media file format standards include International Standards Organization (ISO) base media file format (ISO/IEC 14496-12, which may be abbreviated ISOBMFF), Moving Picture Experts Group (MPEG)-4 file format (ISO/IEC 14496-14, also known as the MP4 format), file format for NAL (Network Abstraction Layer) unit structured video (ISO/IEC 14496-15) and High Efficiency Video Coding standard (HEVC or H.265/HEVC).
  • ISOBMFF International Standards Organization (ISO/IEC 14496-12, which may be abbreviated ISOBMFF), Moving Picture Experts Group (MPEG)-4 file format (ISO/IEC 14496-14, also known as the MP4 format), file format for NAL (Network Abstraction Layer) unit structured video (ISO/IEC 14496-15) and High Efficiency Video Coding standard (HEVC or H.265/HEVC).
  • ISOBMFF International Standards Organization
  • MPEG-4 file format ISO/IEC 14496-14
  • omnidirectional may refer to media content that may have greater spatial extent than a field-of-view of a device rendering the content.
  • Omnidirectional content may for example cover substantially 360 degrees in the horizontal dimension and substantially 180 degrees in the vertical dimension, but omnidirectional may also refer to content covering less than 360 degree-view in the horizontal direction and/or 180 degree-view in the vertical direction.
  • a panoramic image covering a 360-degree field-of-view horizontally and a 180-degree field-of-view vertically can be represented by a sphere that has been mapped to a two-dimensional image plane using the equirectangular projection (ERP).
  • the horizontal coordinate may be considered equivalent to a longitude
  • the vertical coordinate may be considered equivalent to a latitude, with no transformation or scaling applied.
  • panoramic content with a 360-degree horizontal field-of-view, but with less than a 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two-dimensional image plane.
  • panoramic content may have less than a 360-degree horizontal field-of-view and up to a 180-degree vertical field-of-view, while otherwise having the characteristics of an equirectangular projection format.
  • Immersive multimedia, such as omnidirectional content consumption is more complex for the end user compared to the consumption of 2D content. This is due to the higher degree of freedom available to the end user. The freedom also results in more uncertainty.
  • the MPEG Omnidirectional Media Format (OMAF; ISO/IEC 23090-2) v1 standardized the omnidirectional streaming of single 3DoF (3 Degrees of Freedom) content (where the viewer is located at the centre of a unit sphere and has three degrees of freedom (Yaw-Pitch-Roll).
  • a viewport may be defined as a region of omnidirectional image or video suitable for display and viewing by the user.
  • a current viewport (which may be sometimes referred simply as a viewport) may be defined as the part of the spherical video that is currently displayed and hence is viewable by the user(s).
  • a video rendered by an application on a head-mounted display renders a portion of the 360-degrees video, which is referred to as a viewport.
  • a viewport is a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display.
  • a viewport may be characterized by a horizontal field-of-view (VHFoV) and a vertical field-of-view (VVFoV).
  • VHFoV horizontal field-of-view
  • VVFoV vertical field-of-view
  • the 360-degree space may be divided into a discrete set of viewports, each separated by a given distance (e.g., expressed in degrees), so that the omnidirectional space can be imagined as a map of overlapping viewports, and the viewport is switched discretely as the user changes his/her orientation while watching content with a head-mounted display (HMD).
  • HMD head-mounted display
  • Fig.1a illustrates the OMAF system architecture.
  • the system can be situated in a video camera, or in a network server, for example.
  • an omnidirectional media (A) is acquired. If the OMAF system is part of the video source, the omnidirectional media (A) is acquired from the camera means. If the OMAF system is in a network server, the omnidirectional media (A) is acquired from a video source over network.
  • a real-world audio-visual scene (A) may be captured 120 by audio sensors as well as a set of cameras or a camera device with multiple lenses and sensors. The acquisition results in a set of digital image/video (Bi) and audio (Ba) signals.
  • the cameras/lenses may cover all directions around the center point of the camera set or camera device, thus the name of 360-degree video.
  • Audio can be captured using many different microphone configurations and stored as several different content formats, including channel-based signals, static or dynamic (i.e. moving through the 3D scene) object signals, and scene-based signals (e.g., Higher Order Ambisonics).
  • the channel-based signals may conform to one of the loudspeaker layouts defined in CICP (Coding- Independent Code-Points).
  • CICP Coding- Independent Code-Points
  • the loudspeaker layout signals of the rendered immersive audio program may be binaraulized for presentation via headphones.
  • the images (Bi) of the same time instance are stitched, projected, and mapped 121 onto a packed picture (D).
  • the input images of one time instance may be stitched to generate a projected picture representing one view.
  • An example of image stitching, projection, and region-wise packing process for monoscopic content is illustrated with Fig.1b.
  • Input images (Bi) are stitched and projected onto a three-dimensional projection structure that may for example be a unit sphere.
  • the projection structure may be considered to comprise one or more surfaces, such as plane(s) or part(s) thereof.
  • a projection structure may be defined as three- dimensional structure consisting of one or more surface(s) on which the captured VR image/video content is projected, and from which a respective projected picture can be formed.
  • the image data on the projection structure is further arranged onto a two-dimensional projected picture (C).
  • projection may be defined as a process by which a set of input images are projected onto a projected picture.
  • representation formats of the projected picture including for example an equirectangular projection (ERP) format and a cube map projection (CMP) format. It may be considered that the projected picture covers the entire sphere.
  • ERP equirectangular projection
  • CMP cube map projection
  • a region-wise packing is then applied to map the projected picture (C) onto a packed picture (D). If the region-wise packing is not applied, the packed picture is identical to the projected picture, and this picture is given as input to image/video encoding.
  • regions of the projected picture (C) are mapped onto a packed picture (D) by indicating the location, shape, and size of each region in the packed picture, and the packed picture (D) is given as input to image/video encoding.
  • region-wise packing may be defined as a process by which a projected picture is mapped to a packed picture.
  • packed picture may be defined as a picture that results from region-wise packing of a projected picture.
  • Both views (CL, CR) can be mapped onto the same packed picture (D), and encoded by a traditional 2D video encoder.
  • each view of the projected picture can be mapped to its own packed picture, in which case the image stitching, projection, and region-wise packing is performed as illustrated in Fig.1b.
  • a sequence of packed pictures of either the left view or the right view can be independently coded or, when using a multiview video encoder, predicted from the other view.
  • An example of image stitching, projection, and region-wise packing process for stereoscopic content where both views are mapped onto the same packed picture, as shown in Fig. 1a is described next in more detailed manner.
  • Input images (Bi) are stitched and projected onto two three-dimensional projection structures, one for each eye.
  • the image data on each projection structure is further arranged onto a two-dimensional projected picture (CL for left eye, CR for right eye), which covers the entire sphere.
  • Frame packing is applied to pack the left view picture and right view picture onto the same projected picture.
  • region-wise packing is then applied to the pack projected picture onto a packed picture, and the packed picture (D) is given as input to image/video encoding. If the region-wise packing is not applied, the packed picture is identical to the projected picture, and this picture is given as input to image/video encoding.
  • the image stitching, projection, and region-wise packing process can be carried out multiple times for the same source images to create different versions of the same content, e.g. for different orientations of the projection structure. Similarly, the region-wise packing process can be performed multiple times from the same projected picture to create more than one sequence of packed pictures to be encoded.
  • 360-degree panoramic content i.e., images and video
  • the vertical field-of-view may vary and can be e.g.180 degrees.
  • Panoramic image covering 360-degree field-of-view horizontally and 180-degree field-of-view vertically can be represented by a sphere that has been mapped to a two-dimensional image plane using equirectangular projection (ERP).
  • the horizontal coordinate may be considered equivalent to a longitude
  • the vertical coordinate may be considered equivalent to a latitude, with no transformation or scaling applied.
  • the process of forming a monoscopic equirectangular panorama picture is illustrated in Fig.1c.
  • a set of input images 111 such as fisheye images of a camera array or a camera device with multiple lenses and sensors, is stitched 112 onto a spherical image 113.
  • the spherical image is further projected 114 onto a cylinder 115 (without the top and bottom faces).
  • the cylinder is unfolded 116 to form a two- dimensional projected picture 117.
  • one or more of the presented steps may be merged; for example, the input images may be directly projected onto a cylinder without an intermediate projection onto a sphere.
  • the projection structure for equirectangular panorama may be considered to be a cylinder that comprises a single surface.
  • 360-degree content can be mapped onto different types of solid geometrical structures, such as polyhedron (i.e.
  • panoramic content with 360-degree horizontal field-of-view but with less than 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two-dimensional image plane.
  • a panoramic image may have less than 360-degree horizontal field-of- view and up to 180-degree vertical field-of-view, while otherwise has the characteristics of equirectangular projection format.
  • a coordinate system may be defined through orthogonal coordinate axes, such as X (lateral), Y (vertical, pointing upwards), and Z (back-to-front axis, pointing outwards). Rotations around the axes may be defined and may be referred to as yaw, pitch, and roll. Yaw may be defined to rotate around the Y axis, pitch around the X axis, and roll around the Z axis.
  • Rotations may be defined to be extrinsic, i.e., around the X, Y, and Z fixed reference axes.
  • the angles may be defined to increase clockwise when looking from the origin towards the positive end of an axis.
  • the coordinate system specified can be used for defining the sphere coordinates, which may be referred to azimuth ( ⁇ ) and elevation ( ⁇ ).
  • Global coordinate axes may be defined as coordinate axes, e.g. according to the coordinate system as discussed above, that are associated with audio, video, and images representing the same acquisition position and intended to be rendered together.
  • the origin of the global coordinate axes is usually the same as the center point of a device or rig used for omnidirectional audio/video acquisition as well as the position of the observer's head in the three- dimensional space in which the audio and video tracks are located.
  • the playback may be recommended to be started using the orientation (0, 0) in (azimuth, elevation) relative to the global coordinate axes.
  • the projection structure may be rotated relative to the global coordinate axes. The rotation may be performed for example to achieve better compression performance based on the spatial and temporal activity of the content at certain spherical parts. Alternatively or additionally, the rotation may be performed to adjust the rendering orientation for already encoded content.
  • the horizon of the encoded content may be adjusted afterwards by indicating that the projection structure is rotated relative to the global coordinate axes.
  • the projection orientation may be indicated as yaw, pitch, and roll angles that define the orientation of the projection structure relative to the global coordinate axes.
  • the projection orientation may be included e.g. in a box in a sample entry of an ISOBMFF track for omnidirectional video.
  • 360-degree panoramic content i.e., images and video
  • the vertical field-of- view may vary and can be e.g.180 degrees.
  • Panoramic image covering 360-degree field-of-view horizontally and 180-degree field-of-view vertically can be represented by a sphere that has been mapped to a two-dimensional image plane using equirectangular projection (ERP).
  • the horizontal coordinate may be considered equivalent to a longitude
  • the vertical coordinate may be considered equivalent to a latitude, with no transformation or scaling applied.
  • panoramic content with 360-degree horizontal field-of-view but with less than 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two-dimensional image plane.
  • panoramic content may have less than 360-degree horizontal field-of-view and up to 180-degree vertical field-of-view, while otherwise have the characteristics of equirectangular projection format.
  • cube map projection format spherical video is projected onto the six faces (a.k.a. sides) of a cube.
  • the cube map may be generated e.g. by first rendering the spherical scene six times from a viewpoint, with the views defined by a 90 degree view frustum representing each cube face.
  • the cube sides may be frame-packed into the same frame or each cube side may be treated individually (e.g. in encoding).
  • a cube map can be stereoscopic.
  • a stereoscopic cube map can e.g. be reached by re- projecting each view of a stereoscopic panorama to the cube map format.
  • Region-wise packing information may be encoded as metadata in or along the bitstream.
  • the packing information may comprise a region-wise mapping from a pre-defined or indicated source format to the packed picture format, e.g. from a projected picture to a packed picture, as described earlier.
  • Rectangular region-wise packing metadata may be described as follows: [0068] For each region, the metadata defines a rectangle in a projected picture, the respective rectangle in the packed picture, and an optional transformation of rotation by 90, 180, or 270 degrees and/or horizontal and/or vertical mirroring. Rectangles may, for example, be indicated by the locations of the top-left corner and the bottom-right corner.
  • the mapping may comprise resampling. As the sizes of the respective rectangles can differ in the projected and packed pictures, the mechanism infers region-wise resampling.
  • region-wise packing provides signalling for the following usage scenarios: 1) Additional compression for viewport-independent projections is achieved by densifying sampling of different regions to achieve more uniformity across the sphere. For example, the top and bottom parts of ERP are oversampled, and region-wise packing can be applied to down-sample them horizontally. 2) Arranging the faces of plane-based projection formats, such as cube map projection, in an adaptive manner. 3) Generating viewport-dependent bitstreams that use viewport-independent projection formats. For example, regions of ERP or faces of CMP can have different sampling densities and the underlying projection structure can have different orientations. 4) Indicating regions of the packed pictures represented by an extractor track.
  • a guard band may be defined as an area in a packed picture that is not rendered but may be used to improve the rendered part of the packed picture to avoid or mitigate visual artifacts such as seams.
  • the OMAF allows the omission of image stitching, projection, and region-wise packing and encode the image/video data in their captured format. In his case, images (D) are considered the same as images (Bi) and a limited number of fisheye images per time instance are encoded.
  • the stitching process is not needed, since the captured signals are inherently immersive and omnidirectional.
  • the stitched images (D) are encoded 206 as coded images (Ei) or a coded video bitstream (Ev).
  • the captured audio (Ba) is encoded 122 as an audio bitstream (Ea).
  • the coded images, video, and/or audio are then composed 124 into a media file for file playback (F) or a sequence of an initialization segment and media segments for streaming (Fs), according to a particular media container file format.
  • the media container file format is the ISO base media file format.
  • the file encapsulator 124 also includes metadata into the file or the segments, such as projection and region-wise packing information assisting in rendering the decoded packed pictures.
  • the metadata in the file may include: - the projection format of the projected picture, - fisheye video parameters, - the area of the spherical surface covered by the packed picture, - the- orientation of the projection structure corresponding to the projected picture relative to the global coordinate axes, - region-wise packing information, and - region-wise quality ranking (optional).
  • Region-wise packing information may be encoded as metadata in or along the bitstream, for example as region-wise packing SEI message(s) and/or as region-wise packing boxes in a file containing the bitstream.
  • the packing information may comprise a region-wise mapping from a pre-defined or indicated source format to the packed picture format, e.g.
  • the region-wise mapping information may for example comprise for each mapped region a source rectangle (a.k.a. projected region) in the projected picture and a destination rectangle (a.k.a. packed region) in the packed picture, where samples within the source rectangle are mapped to the destination rectangle and rectangles may for example be indicated by the locations of the top-left corner and the bottom-right corner.
  • the mapping may comprise resampling.
  • the packing information may comprise one or more of the following: the orientation of the three-dimensional projection structure relative to a coordinate system, indication which projection format is used, region-wise quality ranking indicating the picture quality ranking between regions and/or first and second spatial region sequences, one or more transformation operations, such as rotation by 90, 180, or 270 degrees, horizontal mirroring, and vertical mirroring.
  • the semantics of packing information may be specified in a manner that they are indicative for each sample location within packed regions of a decoded picture which is the respective spherical coordinate location.
  • the segments (Fs) may be delivered 125 using a delivery mechanism to a player.
  • the file that the file encapsulator outputs (F) is identical to the file that the file decapsulator inputs (F').
  • a file decapsulator 126 processes the file (F') or the received segments (F's) and extracts the coded bitstreams (E'a, E'v, and/or E'i) and parses the metadata.
  • the audio, video, and/or images are then decoded 128 into decoded signals (B'a for audio, and D' for images/video).
  • the decoded packed pictures (D') are projected 129 onto the screen of a head- mounted display or any other display device 130 based on the current viewing orientation or viewport and the projection, spherical coverage, projection structure orientation, and region-wise packing metadata parsed from the file.
  • decoded audio (B'a) is rendered 129, e.g. through headphones 131, according to the current viewing orientation.
  • the current viewing orientation is determined by the head tracking and possibly also eye tracking functionality 127.
  • the renderer 129 may also be used the video and audio decoders 128 for decoding optimization.
  • the process described above is applicable to both live and on-demand use cases.
  • a video rendered by an application on a HMD or on another display device renders a portion of the 360-degree video. This portion may be defined as a viewport.
  • a viewport may be understood as a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display. According to another definition, a viewport may be defined as a part of the spherical video that is currently displayed. A viewport may be characterized by horizontal and vertical field of views (FOV or FoV). [0080]
  • a viewpoint may be defined as the point or space from which the user views the scene; it usually corresponds to a camera position. Slight head motion does not imply a different viewpoint.
  • a viewing position may be defined as the position within a viewing space from which the user views the scene.
  • a viewing space may be defined as a 3D space of viewing positions within which rendering of image and video is enabled and VR experience is valid.
  • An omnidirectional image may be divided into several regions called tiles.
  • the tiles may have been encoded as motion constrained tiles with different quality/resolution.
  • a client apparatus may request the regions/tiles corresponding to a current viewport of the user with high resolution/quality.
  • the term omnidirectional may refer to media content that has greater spatial extent than a field-of-view of a device rendering the content.
  • Omnidirectional content may for example cover substantially 360 degrees in horizontal dimension and substantially 180 degrees in vertical dimension, but omnidirectional may also refer to content covering less than 360 degree view in horizontal direction and/or 180 degree view in vertical direction.
  • the client e.g.
  • the player may request the whole 360-degree video/image either with uniform quality, which means a viewport independent delivery, or such that the quality of the video/image in a viewport of the user is higher than the quality of the video/image in the non- viewport part of the scene, which means a viewport dependent delivery.
  • uniform quality means a viewport independent delivery
  • the quality of the video/image in a viewport of the user is higher than the quality of the video/image in the non- viewport part of the scene, which means a viewport dependent delivery.
  • the (requested) 360-degree video may be encoded at different bitrates. Each encoded bitstream may be stored with, for example, ISOBMFF and then segmented based on MPEG-DASH.
  • the whole 360-degree video may be delivered to the client/player uniformly at the same quality.
  • the (requested) 360-degree video may be divided into several regions/tiles and encoded as, for example, motion constrained tiles.
  • Each encoded tiled bitstream may be stored with, for example, ISOBMFF and then segmented based on MPEG- DASH.
  • the regions/tiles corresponding to the user's viewport may be delivered at high quality/resolution, whereas other parts of 360-degree video which are not within the user's viewport may be delivered at a lower quality/resolution.
  • a tile track may be defined as a track that contains sequences of one or more motion- constrained tile sets of a coded bitstream.
  • Decoding of a tile track without the other tile tracks of the bitstream may require a specialized decoder, which may be e.g. required to skip absent tiles in the decoding process.
  • An HEVC tile track specified in ISO/IEC 14496-15 enables storage of one or more temporal motion-constrained tile sets as a track.
  • the sample entry type 'hvt1' is used.
  • the sample entry type 'lht1' is used.
  • a sample of a tile track consists of one or more complete tiles in one or more complete slice segments.
  • a tile track is independent from any other tile track that includes VCL NAL units of the same layer as this tile track.
  • a tile track has a 'tbas' track reference to a tile base track.
  • the tile base track does not include VCL NAL units.
  • a tile base track indicates the tile ordering using a 'sabt' track reference to the tile tracks.
  • An HEVC coded picture corresponding to a sample in the tile base track can be reconstructed by collecting the coded data from the tile-aligned samples of the tracks indicated by the 'sabt' track reference in the order of the track references.
  • a constructed tile set track is a tile set track, e.g. a track according to ISOBMFF, containing constructors that, when executed, result into a tile set bitstream.
  • a constructor is a set of instructions that, when executed, results into a valid piece of sample data according to the underlying sample format.
  • An extractor is a constructor that, when executed, copies the sample data of an indicated byte range of an indicated sample of an indicated track. Inclusion by reference may be defined as an extractor or alike that, when executed, copies the sample data of an indicated byte range of an indicated sample of an indicated track.
  • bitstream ⁇ is a tile set ⁇ track
  • optionB ⁇ illustrates alternatives, i.e.
  • a full-picture-compliant tile set track can be played as with any full-picture track using the parsing and decoding process of full-picture tracks.
  • a full-picture-compliant bitstream can be decoded as with any full-picture bitstream using the decoding process of full-picture bitstreams.
  • a full-picture track is a track representing an original bitstream (including all its tiles).
  • a tile set bitstream is a bitstream that contains a tile set of an original bitstream but not representing the entire original bitstream.
  • a tile set track is a track representing a tile set of an original bitstream but not representing the entire original bitstream.
  • a full-picture-compliant tile set track may comprise extractors as defined for HEVC.
  • An extractor may, for example, be an in-line constructor including a slice segment header and a sample constructor extracting coded video data for a tile set from a referenced full-picture track.
  • An in-line constructor is a constructor that, when executed, returns the sample data that it contains.
  • an in-line constructor may comprise a set of instructions for rewriting a new slice header. The phrase in-line may be used to indicate coded data that is included in the sample of a track.
  • a full-picture track is a track representing an original bitstream (including all its tiles).
  • a NAL-unit-like structure refers to a structure with the properties of a NAL unit except that start code emulation prevention is not performed.
  • a pre-constructed tile set track is a tile set track containing the sample data in-line.
  • a tile set bitstream is a bitstream that contains a tile set of an original bitstream but not representing the entire original bitstream.
  • a tile set track is a track representing a tile set of an original bitstream but not representing the entire original bitstream.
  • Video codec may comprise an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form.
  • a video encoder and/or a video decoder may also be separate from each other, i.e. need not form a codec. Typically, encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
  • a video encoder may be used to encode an image sequence, as defined subsequently, and a video decoder may be used to decode a coded image sequence.
  • a video encoder or an intra coding part of a video encoder or an image encoder may be used to encode an image, and a video decoder or an inter decoding part of a video decoder or an image decoder may be used to decode a coded image.
  • indicating along the bitstream may be defined to refer to out-of-band transmission, signalling, or storage in a manner that the out-of-band data is associated with the bitstream.
  • the phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signalling, or storage) that is associated with the bitstream.
  • an indication along the bitstream may refer to metadata in a container file that encapsulates the bitstream.
  • circuitry' may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and (c) hardware circuit(s) and/or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when needed for operation.
  • hardware-only circuit implementations such as implementations in analog circuitry and/or digital circuitry
  • combinations of circuits and software such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with
  • circuitry' also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portions of a hardware circuit or processor and its (or their) accompanying software and/or firmware.
  • the term 'circuitry' also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device or other computing or network device.
  • a "computer-readable transmission medium” which refers to an electromagnetic signal.
  • certain example embodiments generally relate to encoding of volumetric video for compression and a definition of metadata structures and compression methods for individual volumetric video components. For example, a character or scene captured with a set of depth cameras, or a synthetically modelled and animated 3D scene are examples of 3D content that can be encoded as volumetric video.
  • Approaches for volumetric video compression often include segmenting the 3D content into a set of 2D patches containing attributes such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like, which can then be compressed using a standard 2D video compression format.
  • attributes e.g., color and geometry data
  • Volumetric video compression is currently being explored and standardized as part of the MPEG-I Point Cloud Compression (PCC) and 3DoF+ efforts, for example.
  • Some current V-PCC specifications define similar metadata structures while setting limits to patch packing strategies by defining shared patch layouts for all components (attributes such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like) of volumetric video.
  • some structures for a metadata format can enable application of different atlas packing methods for different components of 3D video, thus resulting in significantly smaller atlas sizes and overall bitrates.
  • associated methods can be applied for individual volumetric video components.
  • V-PCC standardization in MPEG defines many structures that an example embodiment of the disclosed approaches can leverage.
  • attributes such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like.
  • patch packing is a bin packing problem, and many optimizations exist for coming up with an optimal patch layout (e.g., sprite texture packing).
  • a pipeline for 3DoF+ delivery can leverage a level of temporal coherency, which allows for maintaining constant patch layouts for atlases over an entire group of pictures (GoP), typically an intra period. This approach may enable more efficient compression for individual volumetric video components while requiring less frequent metadata updates, among other benefits.
  • a method can include separating patch layouts of different types of volumetric video component (depth, texture, roughness, normals, etc.).
  • using the structures in the corresponding metadata as means for storing atlas, frame, or tile characteristic information allows for the selection of one or more packing strategies based on the characteristics of those components. Since any particular packing strategy may not always work for all components, for example when the packing strategy is only applicable for a single component, such an approach can pick a suitable, preferable, best, optimal or other such packing strategy for a particular volumetric video section, frame, atlas, and/or tile individually. In some embodiments, such an approach can yield a single set of metadata for a plurality of frames while the video atlas contains per-frame data.
  • a volumetric video preparation and packing approach can take advantage of different metadata for different tile characteristics (e.g., color tiles, geometry tiles, etc.), and can employ different packing strategies for different characteristics depending on the content.
  • V-PCC specification is quite flexible and as such its encapsulation in an ISOBMFF can be done in several ways, e.g. using single-track containers or multi-track containers.
  • a multi-track ISOBMFF V-PCC container as shown in Fig.6 is envisioned, where V-PCC units in a V-PCC elementary stream are mapped to individual tracks within the container file based on their types.
  • V-PCC track is a track carrying the volumetric visual information in the V-PCC bitstream, which includes atlas sub-bitstream (the patch information, sequence parameter sets, SEI messages).
  • V-PCC component tracks are restricted video scheme tracks which carry 2D video encoded data for the occupancy map, geometry, and attribute sub-bitstreams of the V-PCC bitstream.
  • Tracks belonging to the same V-PCC sequence are time-aligned. Samples that contribute to the same point cloud frame across the different video-encoded V-PCC component tracks and the V-PCC track shall have the same presentation time.
  • V-PCC patch parameter sets used for such samples shall have a decoding time equal or prior to the composition time of the point cloud frame.
  • all tracks belonging to the same V-PCC sequence shall have the same implied or explicit edit lists.
  • compression methods can be applied that only work for a first attribute, such as a depth component, without adversely effecting a second attribute, such as a color quality.
  • compression methods can be applied that only work for the color quality without adversely affecting the depth component, and/or for other attributes.
  • volumetric video compression can be carried out, generally, in a compression pipeline.
  • individual packing strategies can be applied for different components of the volumetric video in the context of that pipeline.
  • at least some of the tiles comprising an image or the image itself can be packed by way of a first approach into a video stream while the metadata corresponding to the tiles or the image is packed via a second approach into a metadata stream.
  • a group of pictures can be split into frames and each frame can be subdivided into tiles based on an attribute such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like.
  • a portion of the tiles of a frame can be considered static tiles in that the tile characteristic remains unchanged or changes only within a predetermined range or variance between frames within the GoP.
  • tiles that are static between a plurality of the frames within the GoP can be stored as a single instance of the tile and the associated frames to which the single instance corresponds can be stored in the metadata stream.
  • Such approaches may lead to a reduction in computational complexity of encoding/decoding, decreased transmission bandwidth, and decreased storage requirements when deploying the decoded volumetric video for viewing.
  • particular metadata format structures can be used to support the particular packing methods described herein. [0112] In the following, some background information related to visual volumetric video-based coding (3VC) will be provided. [0113] In a highest level 3VC metadata is carried in vpcc_units which consist of header and payload pairs.
  • Table 1 General V-PCC unit syntax
  • Table 2 V-PCC unit header syntax
  • Table 3 VPCC unit payload syntax [0114] 3VC metadata is contained in atlas_sub_bistream() which may contain a sequence of NAL units including header and payload data.
  • nal_unit_header() is used to define how to process the payload data.
  • NumBytesInNalUnit specifies the size of the NAL unit in bytes. This value is required for decoding of the NAL unit.
  • NAL unit boundaries Some form of demarcation of NAL unit boundaries is necessary to enable inference of NumBytesInNalUnit.
  • One such demarcation method is specified in Annex C (23090-5) for the sample stream format.
  • 3VC atlas coding layer (ACL) is specified to efficiently represent the content of the patch data.
  • the NAL is specified to format that data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes.
  • a NAL unit specifies a generic format for use in both packet-oriented and bitstream systems.
  • nal_unit_header() syntax nal_unit_type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 73 of 23090-5.
  • nal_layer_id specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies.
  • nal_layer_id shall be in the range of 0 to 62, inclusive.
  • the value of 63 may be specified in the future by ISO/IEC. Decoders conforming to a profile specified in Annex A of the current version of 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of nal_layer_id not equal to 0. [0117]
  • rbsp_byte[ i ] is the i-th byte of an RBSP.
  • An RBSP is specified as an ordered sequence of bytes as follows: The RBSP contains a string of data bits (SODB) as follows: • If the SODB is empty (i.e., zero bits in length), the RBSP is also empty.
  • the RBSP contains the SODB as follows: o
  • the first byte of the RBSP contains the first (most significant, left-most) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain.
  • the rbsp_trailing_bits( ) syntax structure is present after the SODB as follows: ⁇
  • the first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any).
  • the next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit).
  • the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, right-most) bit equal to 1, and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0.
  • the data necessary for the decoding process is contained in the SODB part of the RBSP.
  • Tables 6 to 10 describe some of the most relevant RBSP syntaxes.
  • Table 6 Atlas tile group layer RBSP syntax
  • Table 7 Atlas tile group header syntax
  • Table 8 General atlas tile group data unit syntax
  • Table 9 Patch information data syntax
  • Table 10 Patch data unit syntax [0100] Annex F of 3VC V-PCC specification (23090-5) describes different SEI messages that have been defined for 3VC MIV purposes. SEI messages assist in processes related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non-essential.3VC SEI messages are signalled in sei_rspb() which is documented in Table 11 below. Table 11: Patch data unit syntax [0122] Non-essential SEI messages are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance.
  • non-essential SEI messages When present in the bitstream, non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5). When the content of a non-essential SEI message is conveyed for the application by some means other than presence within the bitstream, the representation of the content of the SEI message is not required to use the same syntax specified in annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are actually present in the bitstream are counted.
  • Essential SEI messages are an integral part of the V-PCC bitstream and should not be removed from the bitstream.
  • the essential SEI messages are categorized into two types: Type-A essential SEI messages and Type-B essential SEI messages.
  • Type-A essential SEI messages contain information required to check bitstream conformance and for output timing decoder conformance. Every V-PCC decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance.
  • V-PCC decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type-B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes.
  • a new 3VC SEI message may be used to signal separate atlas settings for different encoded video components. Such a new SEI message may require reserving a new id in 3VC specification.
  • the SEI message may be signalled in an atlas_sub_bitstream() which contains NAL units that describe atlas related metadata.
  • the SEI message may be inserted in the bitstream to signal that specific NAL units should only be applied on a specific component of the volumetric video stream.
  • the signalling may be done flexibly before or after any relevant NAL units using SEI prefix and suffix functionality as described in 23090-5.
  • This design introduces minimal changes to 3VC bitstream and focuses on reusing syntax elements which have been designed to describe shared patch layout.
  • the decoder should be able to interpret this SEI message, if it is present in the bitstream, because rendering of 3VC content may fail otherwise as incorrect patch data structure is expected.
  • the parser will parse atlas_sub_bitstream() as usual and process NAL units. Only when it encounters the new SEI message will it apply following or preceding NAL unit(s) only to a specific component or attribute of the atlas.
  • the SEI message may be used to apply to atlas_sequence_parameter_set_rbsp( ), atlas_frame_parameter_set_rbsp() or other NAL unit level parameter to enable signalling of different size video components or other atlas sequence and frame level settings.
  • the SEI message may be applied to other NAL units like atlas_tile_group_layer_rbsp() to signal different atlas patch layouts per video encoded component or attribute id.
  • the SEI message may be applied to other SEI messages found in atlas_sub_bitstream() to signal application of other SEI messages per video encoded component or attribute id.
  • the SEI message may be used for any other type of NAL units found in atlas_sub_bitstream() to signal different values per video encoded component or attribute type.
  • the presence of this new SEI message allows signalling shared patch layouts by default.
  • NAL units for shared components may be signalled simply by not using the new SEI message.
  • these settings are applied to all video encoded components and only if the new SEI message is encountered the settings are applied to the specific component as indicated by the SEI message itself.
  • Structurally the new SEI message may be described as described by the following Table 12, in accordance with an embodiment: [0137]
  • the attribute component_type signals which atlas component the following or preceding NAL units should be applied to. Only values greater than 1 may be considered valid.
  • the attribute attribute_index should only be processed if the attribute component_type equals 4. This attribute allows signalling of different patch layouts for different attribute types.
  • component_type value may be used to indicate that the attribute_index should be processed.
  • the signalling does not need to change and all metadata may be stored. The design will maintain compatibility with a design where atlas_sub_bitstream() is stored in a single track per atlas.
  • the separate_atlas_component() SEI message may be stored as sample auxiliary information in the ISOBMFF storage structure and be associated with the samples that are relevant to the applied atlas component. Such an approach may assure that the SEI message is distributed with the samples and such SEI messages may be inserted at the beginning of the relevant sample media data, before being fed to the V-PCC decoder.
  • the component type and attribute id may be included in a NAL unit header.
  • One benefit of signalling component type and attribute id inside NAL unit header is that it maintains ability to store layout data of different components inside the same atlas_sub_bitstream() which may be stored in a single ISOBMFF track. From storage point of view the encapsulation of said track does not have to change.
  • Processing of atlas_sub_bitstream() will follow principles as explained for SEI message above.
  • a default value for the component_type attribute in a NAL unit header may be used to signal that the NAL unit payload is applicable to all atlas components.
  • a default value may be used to signal shared atlas payloads, for example depth and occupancy components.
  • a NAL unit with component_type equal to 4 may be signalled in addition to provide different patch layout for texture component. Different settings for different attributes may be signalled using attribute_id field in NAL unit header.
  • the NAL unit header may have the following changes, highlighted with a grey background in Table 13 (as well as in the other succeeding Tables): Table 13 [0145]
  • the attributes have the following specification: [0146]
  • nal_component_type shall signal the atlas component to which the NAL payloads should be applied to. Only values greater than 1 may be considered valid.
  • signalling component_type with value 0 indicates that the NAL unit payload will be applicable to all components.
  • nal_attribute_index shall only be processed if the component_type equals 4 (or another predetermined value, as was mentioned above). This attribute allows signalling of different patch layouts for different attribute types.
  • signalling different component layouts using the syntax structure vpcc_unit_header() will be described.
  • component type and attribute id may be signalled in vpcc_unit_header which allows to store separate metadata tracks for each video encoded component and attribute.
  • vpcc_unit_header() definition in the current 3VC specification does not support signalling different metadata for different component or attribute types.
  • the Table 12 below illustrates changes for vpcc_unit_header() structure, highlighted with grey background.
  • vuh_component_type shall signal the atlas component to which the unit payload should be applied to. Only values greater than 1 may be considered valid. Signalling component_type with value 0 indicates that the vpcc unit payload will be applicable to all components. [0154] nal_attribute_index shall only be processed if component_type equals 4. This attribute allows signalling of different patch layouts for different attribute types. [0155] In the following, signalling mapping of atlas layer to video component or group of video components will be described, in accordance with an embodiment.
  • nal_layer_id of an atlas NAL unit should be in the range of 0 to 62, inclusive. However, it should be noted that also other valid value range may be defined within which the value of nal_layer_id of an atlas NAL unit should be.
  • an atlas NAL sub-bitstream contains more than one layer and each layer is identified by a nuh_layer_id.
  • mapping between the video components and nuh_layer_id is provided in V-PCC parameter set.
  • Table 15 illustrates an example syntax for the signalling of mapping of atlas layer to video component or group of video components.
  • Table 15 [0159] ]ps_atlas_count_minus1 is a syntax element defined in vpcc_parameter_set( ).
  • alm_occupancy_to_atlas_layer_id[ j ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an occupancy of atlas with index j.
  • alm_geometry_to_atlas_layer_id[ j ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for a geometry of atlas with index j.
  • ai_attribute_count is a syntax element defined in attribute_information().
  • alm_attribute_to_atlas_layer_id[ j ][ i ] indicates an atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an attribute with index i of atlas with index j.
  • vps_flexible_atlas_packing_flag indicates that flexible packing per each video component or group of components is supported.
  • atlas NAL sub-bitstream contains more than one layer and each layer is identified by a nuh_layer_id. The mapping between the video components and nuh_layer_id is provided in atlas sequence parameter set.
  • Table 17 [0166] alm_occupancy_to_atlas_layer_id indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an occupancy video sub-bitstream.
  • alm_geometry_to_atlas_layer_id[ j ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for a geometry video sub-bitstream.
  • ai_attribute_count is syntax element defined in attribute_information() for each atlas.
  • alm_attribute to_atlas_ layer_id[ i ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an attribute video sub-bitstream with index i of this atlas.
  • asps_flexible_atlas_packing_flag indicates that flexible packing per each video component or group of components is supported.
  • atlas NAL sub-bitstream contains more than one layer and each layer is identified by a nuh_layer_id.
  • the mapping between the video components and nuh_layer_id is provided in an atlas frame parameter set as is illustrated in the following Table 19, in accordance with an embodiment.
  • Table 19 [0172] alm_occupancy_to_atlas_layer_id indicates an atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an occupancy video sub-bitstream.
  • alm_geometry_to_atlas_layer_id[ j ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for a geometry video sub-bitstream.
  • ai_attribute_count is a syntax element defined in attribute_information() for each atlas.
  • alm_attribute_to_atlas_ layer_id[ i ] indicates an atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an attribute video sub-bitstream with index i of this atlas.
  • afps_flexible_atlas_packing_flag indicates that flexible packing per each video component or group of components is supported.
  • the vpcc_header for each component or attribute may be stored in a sample entry of said track and sub_atlas_bitstream() of said units may be stored in samples of individual tracks.
  • atlas_layer_mapping() information may be signalled using a separate timed metadata track where samples of this timed metadata track is aligned with the related video tracks.
  • Such a timed media track may be associated with the relevant 3VC media tracks’ DASH representations via@associationId and @associationType attributes.
  • Fig.4 the basic compression of a volumetric video scene is illustrated as a video compression pipeline 100.
  • each frame of an input 3D scene 101 can be processed separately, and the resulting per-frame atlas and metadata are then stored into separate video and metadata streams, respectively.
  • the frames of the input 3D scene 101 can be processed using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7).
  • the input 3D scene 101 is converted, at Input Conversion 102 into a canonical representation for processing.
  • each frame of the input 3D scene 101 is converted at Input Conversion 102 into a collection of 3D samples of a scene geometry, at a specified internal processing resolution. Depending on the input 3D scene 101, this may involve, e.g., voxelizing a mesh model, or down-sampling a high resolution point cloud with very fine details into the processing resolution.
  • the internal representation resulting from the Input Conversion 102 is a point cloud representing some or all aspects of the 3D input scene 101.
  • the aspects of the 3D input scene 101 can include but are not limited to attributes such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like of the 3D scene 101.
  • the input 3D scene 101 can be converted into, for example, a canonical representation using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7).
  • a View Optimizer 103 creates, from the internal point cloud format resulting from the Input Conversion 102, a segmentation of the 3D scene 101 optimized for a specified viewing constraint (e.g., the viewing volume).
  • the View Optimizer 103 process can involve creating view-tiles that have sufficient coverage and resolution for representing the original input 3D scene 101 while incurring a minimal quality degradation within the given viewing constraints.
  • the View Optimizer 103 can make use of at least a 3D position of points in the internal point cloud of the 3D scene 101.
  • additional attributes such as a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like may also or alternatively be considered.
  • the View Optimizer 103 can be fully or partially instantiated using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7).
  • View-tile Metadata 104 can be defined that describes each tile of the frame (e.g., tile geometry, material, color, depth, etc.).
  • the resulting view-tiles can then be pre-rendered in a View-tile Rendering 105.
  • View-tile Rendering 105 can include resampling the input point cloud into one or more 2D tile projections, and/or calling an external renderer, e.g. a path tracing renderer, to render views of the original input 3D scene 101.
  • the tiles can be defined, characterized, and/or converted to metadata using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7).
  • volumetric video component atlases depth, texture, roughness, normals, etc.
  • using the structures in the corresponding metadata as means for storing atlas, frame, or tile characteristic information allows for the selection of one or more packing strategies based on the characteristics of those components. Since any particular packing strategy may not always work for all components, for example when the packing strategy is only applicable for a single component, such an approach can pick a suitable, preferable, best, optimal or other such packing strategy for a particular volumetric video section, frame, atlas, and/or tile individually.
  • such an approach can yield a single set of metadata for a plurality of frames while the video atlas contains per-frame data.
  • a volumetric video preparation and packing approach can take advantage of different metadata for different tile characteristics (e.g., color tiles, geometry tiles, etc.), and can employ different packing strategies for different characteristics depending on the content.
  • the rendered tiles can then be input into an Atlas Packer 106.
  • the Atlas Packer 106 can produce an optimal 2D layout of the rendered view- tiles.
  • the Atlas Packer 106 can pack the pre-rendered tiles into video frames.
  • additional metadata may be required to unpack and re-render the packed tiles.
  • such additional metadata can be generated by the Atlas Packer 106.
  • the Atlas Packer 106 can carry out alternative or additional processing procedures such as down- sampling of certain tiles, re-fragmentation of tiles, padding, dilation and the like.
  • the Atlas Packer 106 can be configured to pack the scene into an atlas format that minimizes unused pixels.
  • the Atlas Packer 106 can provide guards for artifacts that might occur in a compression stage.
  • the packed atlases can then be piped to Video Compression 107 to generate a final compressed representation of the 3D scene 101.
  • the final compressed representation of the 3D scene 101 can include compressed view-tiles 108 and corresponding view-tile metadata 104.
  • the Atlas Packer 106 and/or the Video Compression 107 can be fully or partially instantiated using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7).
  • the pipeline 100 can include processes for content delivery and view (e.g., real-time viewing).
  • the compressed video frames (compressed view- tiles 108) and the view-tile metadata 104 can be used for View Synthesis 109 of novel views of the 3D scene 101.
  • the view-tile metadata 104 can contain some or all of the necessary information for View Synthesis 109 (a view synthesizer) to employ any suitable rendering method or combination of rendering methods, such as point cloud rendering, mesh rendering, or ray-casting, to reconstruct a view of the scene from any given 3D viewpoint (assuming the originally specified viewing constraints).
  • any suitable rendering method or combination of rendering methods such as point cloud rendering, mesh rendering, or ray-casting
  • the View Synthesis 109 can be fully or partially instantiated using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7).
  • the Atlas Packer 106 receives as an input at least a list of views and pre-rendered tiles representing those views for at least depth and color components.
  • the Atlas Packer 106 is not limited to color and/or depth components only.
  • volumetric video components such as a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the likecan be processed using the same or a similar approach.
  • the Atlas Packer 106 can process some or all components of the volumetric video in parallel, leveraging dependencies between them and outputting one or more individual atlases for each component packed with pre-rendered tiles.
  • the Atlas Packer 106 may leverage dependencies between different volumetric video components, it may also apply compression methods to each component individually, resulting in different patch layouts and sizes for each component.
  • an apparatus can be configured to carry out some or all portions of any of the methods described herein.
  • the apparatus may be embodied by any of a wide variety of devices including, for example, a video codec.
  • a video codec includes an encoder that transforms input video into a compressed representation suited for storage and/or transmission and/or a decoder that can decompress the compressed video representation so as to result in a viewable form of a video.
  • the encoder discards some information from the original video sequence so as to represent the video in a more compact form, such as at a lower bit rate.
  • the apparatus may, instead, be embodied by any of a wide variety of computing devices including, for example, a video encoder, a video decoder, a computer workstation, a server or the like, or by any of various mobile computing devices, such as a mobile terminal, e.g., a smartphone, a tablet computer, a video game player, etc.
  • the apparatus may be embodied by an image capture system configured to capture the images that comprise the volumetric video data.
  • the apparatus 10 of an example embodiment is depicted in Fig.7 and includes, is associated with, or is otherwise in communication with processing circuitry 12, a memory 14 and a communication interface 16.
  • the processing circuitry may be in communication with the memory device via a bus for passing information among components of the apparatus.
  • the memory device may be non-transitory and may include, for example, one or more volatile and/or non-volatile memories.
  • the memory device may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processing circuitry).
  • the memory device may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present disclosure.
  • the memory device could be configured to buffer input data for processing by the processing circuitry. Additionally or alternatively, the memory device could be configured to store instructions for execution by the processing circuitry.
  • the apparatus 10 may, in some embodiments, be embodied in various computing devices as described above.
  • the apparatus may be embodied as a chip or chip set.
  • the apparatus may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard).
  • the structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon.
  • the apparatus may therefore, in some cases, be configured to implement an embodiment of the present invention on a single chip or as a single “system on a chip.”
  • a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein.
  • the processing circuitry 12 may be embodied in a number of different ways.
  • the processing circuitry may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other circuitry including integrated circuits such as, for example, an ASIC (application specific integrated circuit), an FPGA (field programmable gate array), a microcontroller unit (MCU), a hardware accelerator, a special- purpose computer chip, or the like.
  • the processing circuitry may include one or more processing cores configured to perform independently.
  • a multi-core processing circuitry may enable multiprocessing within a single physical package.
  • the processing circuitry may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.
  • the processing circuitry 12 may be configured to execute instructions stored in the memory device 34 or otherwise accessible to the processing circuitry.
  • the processing circuitry may be configured to execute hard coded functionality.
  • the processing circuitry may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present disclosure while configured accordingly.
  • the processing circuitry may be specifically configured hardware for conducting the operations described herein.
  • the processing circuitry when the processing circuitry is embodied as an executor of instructions, the instructions may specifically configure the processor to perform the algorithms and/or operations described herein when the instructions are executed.
  • the processing circuitry may be a processor of a specific device (e.g., an image or video processing system) configured to employ an embodiment of the present invention by further configuration of the processing circuitry by instructions for performing the algorithms and/or operations described herein.
  • the processing circuitry may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processing circuitry.
  • ALU arithmetic logic unit
  • the communication interface 16 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data, including visual content in the form of video or image files, one or more audio tracks or the like.
  • the communication interface may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the communication interface may alternatively or also support wired communication.
  • the communication interface may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.
  • An attribute picture may be defined as a picture that comprises additional information related to an associated texture picture.
  • An attribute picture may for example comprise surface normal, opacity, or reflectance information for a texture picture.
  • a geometry picture may be regarded as one type of an attribute picture, although a geometry picture may be treated as its own picture type, separate from an attribute picture.
  • Texture picture(s) and the respective geometry picture(s), if any, and the respective attribute picture(s) may have the same or different chroma format.
  • Terms texture image, texture picture and texture component picture may be used interchangeably.
  • Terms geometry image, geometry picture and geometry component picture may be used interchangeably.
  • a specific type of a geometry image is a depth image. Embodiments described in relation to a geometry image equally apply to a depth image, and embodiments described in relation to a depth image equally apply to a geometry image.
  • Terms attribute image, attribute picture and attribute component picture may be used interchangeably.
  • a geometry picture and/or an attribute picture may be treated as an auxiliary picture in video/image encoding and/or decoding.
  • Figs.2a and 2b illustrate an overview of exemplified compression/ decompression processes of Point Cloud Coding (PCC) according to MPEG standard.
  • PCC Point Cloud Coding
  • V-PCC MPEG Video-Based Point Cloud Coding
  • MPEG N18892 MPEG N18892, a.k.a. ISO/IEC JTC 1/SC 29/WG 11 “V-PCC Codec Description”
  • V-PCC Codec Description discloses a projection-based approach for dynamic point cloud compression.
  • V-PCC video-based point cloud compression
  • the patch generation process 202 decomposes the point cloud frame 200 by converting 3d samples to 2d samples on a given projection plane using a strategy that provides the best compression.
  • the patch generation process 202 aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error.
  • the geometry image generation 206 and the texture image generation 208 are configured to generate geometry images and texture images.
  • the image generation process exploits the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch is projected onto two images, referred to as layers.
  • H(u,v) be the set of points of the current patch that get projected to the same pixel (u, v).
  • the first layer also called the near layer, stores the point of H(u,v) with the lowest depth D0.
  • the second layer referred to as the far layer, captures the point of H(u,v) with the highest depth within the interval [D0, D0+ ⁇ ], where ⁇ is a user-defined parameter that describes the surface thickness.
  • the generated videos have the following characteristics: geometry: WxH YUV420-8bit, where the geometry video is monochromatic, and texture: WxH YUV420-8bit, where the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.
  • the geometry images and the texture images may be provided to image padding 212.
  • the image padding 212 may also receive as an input an occupancy map (OM) 210 to be used with the geometry images and texture images.
  • OM occupancy map
  • the padding process aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression.
  • V- PCC uses a simple padding strategy, which proceeds as follows: ⁇ Each block of TxT (e.g., 16x16) pixels is processed independently. ⁇ If the block is empty (i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. ⁇ If the block is full (i.e., no empty pixels), nothing is done. ⁇ If the block has both empty and filled pixels (i.e. a so-called edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors. [0203] The padded geometry images and padded texture images may be provided for video compression 214.
  • the generated images/layers are stored as video frames and compressed using a video codec, such as High Efficiency Video Coding (HEVC) codec.
  • HEVC High Efficiency Video Coding
  • the video compression 214 also generates reconstructed geometry images to be provided for smoothing 216, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 202.
  • the smoothed geometry may be provided to texture image generation 208 to adapt the texture images.
  • auxiliary patch information compression 2108 the following meta data is encoded/decoded for every patch: ⁇ Index of the projection plane o Index 0 for the normal planes (1.0, 0.0, 0.0) and (-1.0, 0.0, 0.0) o Index 1 for the normal planes (0.0, 1.0, 0.0) and (0.0, -1.0, 0.0) o Index 2 for the normal planes (0.0, 0.0, 1.0) and (0.0, 0.0, -1.0).
  • ⁇ 2D bounding box (u0, v0, u1, v1) ⁇ 3D location (x0, y0, z0) of the patch represented in terms of depth ⁇ 0, tangential shift s0 and bi-tangential shift r0.
  • mapping information providing for each TxT block its associated patch index is encoded as follows: For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes.
  • L is called the list of candidate patches.
  • the empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.
  • I index of the patch to which belongs the current TxT block and let J be the position of I in L. Instead of explicitly encoding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency.
  • the occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud.
  • one cell of the 2D grid produces a pixel during the image generation process.
  • an occupancy map when considering an occupancy map as an image, it may be considered to comprise occupancy patches.
  • Occupancy patches may be considered to have block-aligned edges according to the auxiliary information described in the previous section.
  • An occupancy patch hence comprises occupancy information for a corresponding texture and geometry patches.
  • the occupancy map compression leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0).
  • the remaining blocks are encoded as follows.
  • the occupancy map could be encoded with a precision of a B0xB0 blocks.
  • the generated binary image covers only a single colour plane.
  • the list of candidates is sorted in the reverse order of the patches. For each block, o If the list of candidates has one index, then nothing is encoded. o Otherwise, the index of the patch in this list is arithmetically encoded.
  • the point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let ( ⁇ 0, s0, r0) be the 3D location of the patch to which it belongs and (u0, v0, u1, v1) its 2D bounding box.
  • the smoothing procedure 216 aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors.
  • a multiplexer 220 may receive a compressed geometry video and a compressed texture video from the video compression 214, entropy compression 222, and optionally a compressed auxiliary patch information from auxiliary patch-info compression 218. The multiplexer 220 uses the received data to produce a compressed bitstream.
  • Figure 2b illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC).
  • a de-multiplexer 250 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 252. In addition, the de-multiplexer 250 transmits compressed occupancy map to occupancy map decompression 254.
  • auxiliary patch-info compression 256 may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 256.
  • Decompressed geometry video from the video decompression 252 is delivered to geometry reconstruction 258, as are the decompressed occupancy map and decompressed auxiliary patch information.
  • the point cloud geometry reconstruction 258 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.
  • the reconstructed geometry image may be provided for smoothing 260, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors.
  • a V-PCC bitstream containing coded point cloud sequence (CPCS)
  • CPCS coded point cloud sequence
  • VPS V-PCC parameter set
  • a V-PCC bitstream can be stored in ISOBMFF container according to ISO/IEC 23090-10.
  • Single-track container is utilized in the case of simple ISOBMFF encapsulation of a V- PCC encoded bitstream. In this case, a V-PCC bitstream is directly stored as a single track without further processing.
  • Single-track should use sample entry type of 'vpe1' or 'vpeg'.
  • all atlas parameter sets (as defined in ISO/IEC 23090-5) are stored in the setupUnit of sample entry. Under the 'vpeg' sample entry, the atlas parameter sets may be present in setupUnit array of sample entry, or in the elementary stream.
  • Multi-track container maps V-PCC units of a V-PCC elementary stream to individual tracks within the container file based on their types.
  • V-PCC track is a track carrying the volumetric visual information in the V-PCC bitstream, which includes the atlas sub-bitstream and the atlas sequence parameter sets.
  • V-PCC component tracks are restricted video scheme tracks which carry 2D video encoded data for the occupancy map, geometry, and attribute sub-bitstreams of the V-PCC bitstream. Multi-track should use for V-PCC track sample entry type of 'vpc1'or 'vpcg'.
  • V-PCC provides a procedure for compressing a time-varying volumetric scene/object by projecting 3D surfaces onto a number of pre-defined 2D planes, which may then be compressed using regular 2D video compression algorithms. The projection is presented using different patches, where each set of patches may represent a specific object or specific parts of a scene.
  • V-PCC provides for efficient delivery of a compressed 3D point cloud object which can be viewed with six degrees of freedom (6DoF).
  • 6DoF six degrees of freedom
  • the embodiments relating to the encoding aspects may be implemented in an apparatus comprising means for: obtaining a presentation comprising volumetric video content; obtaining two or more components of the volumetric video content; packing the two or more components of the volumetric video content into separate atlases to obtain two or more different encoded components of the volumetric video content; and signaling, in or along the bitstream, separate atlas settings for the two or more different encoded components.
  • the embodiments relating to the encoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a presentation comprising volumetric video content; obtain two or more components of the volumetric video content; pack the two or more components of the volumetric video content into separate atlases to obtain two or more different encoded components of the volumetric video content; and signal, in or along the bitstream, separate atlas settings for the two or more different encoded components.
  • the embodiments relating to the decoding aspects may be implemented in an apparatus comprising means for: receiving a bitstream in a decoder, said bitstream comprising an encoded presentation of two or more components of a volumetric video content; receiving, from or along the bitstream, separate atlas settings for the two or more different encoded components; decoding, from the bitstream, the two or more components of a volumetric video content; depacking the two or more components of the volumetric video content from separate atlases.
  • the embodiments relating to the decoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive a bitstream in a decoder, said bitstream comprising an encoded presentation of two or more components of a volumetric video content; receive, from or along the bitstream, separate atlas settings for the two or more different encoded components; decode, from the bitstream, the two or more components of a volumetric video content; depack the two or more components of the volumetric video content from separate atlases.
  • Such apparatuses may comprise e.g.
  • Fig.5 shows a flow chart for signaling overlay content according to an embodiment.
  • a presentation comprising volumetric video content is obtained.
  • two or more components of the volumetric video content are obtained.
  • the two or more components of the volumetric video content are packed 504 into separate atlases video encoded bitstreams to obtain two or more different encoded components of the volumetric video content.
  • separate atlas settings and patch information for the two or more different encoded components are signaled 506.
  • Fig.8 shows a schematic diagram of an example multimedia communication system within which various embodiments may be implemented.
  • FIG 8 is a graphical representation of an example multimedia communication system within which various embodiments may be implemented.
  • a data source 1510 provides a source signal in an analog, uncompressed digital, or compressed digital format, or any combination of these formats.
  • An encoder 1520 may include or be connected with a pre-processing, such as data format conversion and/or filtering of the source signal.
  • the encoder 1520 encodes the source signal into a coded media bitstream. It should be noted that a bitstream to be decoded may be received directly or indirectly from a remote device located within virtually any type of network. Additionally, the bitstream may be received from local hardware or software.
  • the encoder 1520 may be capable of encoding more than one media type, such as audio and video, or more than one encoder 1520 may be required to code different media types of the source signal.
  • the encoder 1520 may also get synthetically produced input, such as graphics and text, or it may be capable of producing coded bitstreams of synthetic media.
  • only processing of one coded media bitstream of one media type is considered to simplify the description.
  • typically real-time broadcast services comprise several streams (typically at least one audio, video and text sub-titling stream).
  • the system may include many encoders, but in the figure only one encoder 1520 is represented to simplify the description without a lack of generality.
  • the coded media bitstream may be transferred to a storage 1530.
  • the storage 1530 may comprise any type of mass memory to store the coded media bitstream.
  • the format of the coded media bitstream in the storage 1530 may be an elementary self-contained bitstream format, or one or more coded media bitstreams may be encapsulated into a container file, or the coded media bitstream may be encapsulated into a Segment format suitable for DASH (or a similar streaming system) and stored as a sequence of Segments.
  • a file generator (not shown in the figure) may be used to store the one more media bitstreams in the file and create file format metadata, which may also be stored in the file.
  • the encoder 1520 or the storage 1530 may comprise the file generator, or the file generator is operationally attached to either the encoder 1520 or the storage 1530.
  • Some systems operate “live”, i.e. omit storage and transfer coded media bitstream from the encoder 1520 directly to the sender 1540. The coded media bitstream may then be transferred to the sender 1540, also referred to as the server, on a need basis.
  • the format used in the transmission may be an elementary self-contained bitstream format, a packet stream format, a Segment format suitable for DASH (or a similar streaming system), or one or more coded media bitstreams may be encapsulated into a container file.
  • the encoder 1520, the storage 1530, and the server 1540 may reside in the same physical device or they may be included in separate devices.
  • the encoder 1520 and server 1540 may operate with live real-time content, in which case the coded media bitstream is typically not stored permanently, but rather buffered for small periods of time in the content encoder 1520 and/or in the server 1540 to smooth out variations in processing delay, transfer delay, and coded media bitrate.
  • the server 1540 sends the coded media bitstream using a communication protocol stack.
  • the stack may include but is not limited to one or more of Real-Time Transport Protocol (RTP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), Transmission Control Protocol (TCP), and Internet Protocol (IP).
  • RTP Real-Time Transport Protocol
  • UDP User Datagram Protocol
  • HTTP Hypertext Transfer Protocol
  • TCP Transmission Control Protocol
  • IP Internet Protocol
  • the server 1540 encapsulates the coded media bitstream into packets.
  • RTP Real-Time Transport Protocol
  • UDP User Datagram Protocol
  • HTTP Hypertext Transfer Protocol
  • TCP Transmission Control Protocol
  • IP Internet Protocol
  • the sender 1540 may comprise or be operationally attached to a "sending file parser" (not shown in the figure).
  • a sending file parser locates appropriate parts of the coded media bitstream to be conveyed over the communication protocol.
  • the sending file parser may also help in creating the correct format for the communication protocol, such as packet headers and payloads.
  • the multimedia container file may contain encapsulation instructions, such as hint tracks in the ISOBMFF, for encapsulation of the at least one of the contained media bitstream on the communication protocol.
  • the server 1540 may or may not be connected to a gateway 1550 through a communication network, which may e.g. be a combination of a CDN, the Internet and/or one or more access networks.
  • the gateway may also or alternatively be referred to as a middle-box.
  • the gateway may be an edge server (of a CDN) or a web proxy. It is noted that the system may generally comprise any number gateways or alike, but for the sake of simplicity, the following description only considers one gateway 1550.
  • the gateway 1550 may perform different types of functions, such as translation of a packet stream according to one communication protocol stack to another communication protocol stack, merging and forking of data streams, and manipulation of data stream according to the downlink and/or receiver capabilities, such as controlling the bit rate of the forwarded stream according to prevailing downlink network conditions.
  • the gateway 1550 may be a server entity in various embodiments.
  • the system includes one or more receivers 1560, typically capable of receiving, de- modulating, and de-capsulating the transmitted signal into a coded media bitstream.
  • the coded media bitstream may be transferred to a recording storage 1570.
  • the recording storage 1570 may comprise any type of mass memory to store the coded media bitstream.
  • the recording storage 1570 may alternatively or additively comprise computation memory, such as random access memory.
  • the format of the coded media bitstream in the recording storage 1570 may be an elementary self- contained bitstream format, or one or more coded media bitstreams may be encapsulated into a container file. If there are multiple coded media bitstreams, such as an audio stream and a video stream, associated with each other, a container file is typically used and the receiver 1560 comprises or is attached to a container file generator producing a container file from input streams. Some systems operate “live,” i.e. omit the recording storage 1570 and transfer coded media bitstream from the receiver 1560 directly to the decoder 1580.
  • the coded media bitstream may be transferred from the recording storage 1570 to the decoder 1580. If there are many coded media bitstreams, such as an audio stream and a video stream, associated with each other and encapsulated into a container file or a single media bitstream is encapsulated in a container file e.g. for easier access, a file parser (not shown in the figure) is used to decapsulate each coded media bitstream from the container file.
  • a file parser (not shown in the figure) is used to decapsulate each coded media bitstream from the container file.
  • the recording storage 1570 or a decoder 1580 may comprise the file parser, or the file parser is attached to either recording storage 1570 or the decoder 1580. It should also be noted that the system may include many decoders, but here only one decoder 1570 is discussed to simplify the description without a lack of generality [0239]
  • the coded media bitstream may be processed further by a decoder 1570, whose output is one or more uncompressed media streams.
  • a renderer 1590 may reproduce the uncompressed media streams with a loudspeaker or a display, for example.
  • the receiver 1560, recording storage 1570, decoder 1570, and renderer 1590 may reside in the same physical device or they may be included in separate devices.
  • a sender 1540 and/or a gateway 1550 may be configured to perform switching between different representations e.g. for switching between different viewports of 360-degree video content, view switching, bitrate adaptation and/or fast start-up, and/or a sender 1540 and/or a gateway 1550 may be configured to select the transmitted representation(s). Switching between different representations may take place for multiple reasons, such as to respond to requests of the receiver 1560 or prevailing conditions, such as throughput, of the network over which the bitstream is conveyed. In other words, the receiver 1560 may initiate switching between representations.
  • a request from the receiver can be, e.g., a request for a Segment or a Subsegment from a different representation than earlier, a request for a change of transmitted scalability layers and/or sub- layers, or a change of a rendering device having different capabilities compared to the previous one.
  • a request for a Segment may be an HTTP GET request.
  • a request for a Subsegment may be an HTTP GET request with a byte range.
  • bitrate adjustment or bitrate adaptation may be used for example for providing so-called fast start-up in streaming services, where the bitrate of the transmitted stream is lower than the channel bitrate after starting or random-accessing the streaming in order to start playback immediately and to achieve a buffer occupancy level that tolerates occasional packet delays and/or retransmissions.
  • Bitrate adaptation may include multiple representation or layer up-switching and representation or layer down- switching operations taking place in various orders.
  • a decoder 1580 may be configured to perform switching between different representations e.g. for switching between different viewports of 360-degree video content, viewpoint switching, bitrate adaptation and/or fast start-up, and/or a decoder 1580 may be configured to select the transmitted representation(s).
  • the decoder may comprise means for requesting at least one decoder reset picture of the second representation for carrying out bitrate adaptation between the first representation and a third representation.
  • Faster decoding operation might be needed for example if the device including the decoder 1580 is multi-tasking and uses computing resources for other purposes than decoding the video bitstream.
  • faster decoding operation might be needed when content is played back at a faster pace than the normal playback speed, e.g. twice or three times faster than conventional real-time playback rate.
  • said encoding may comprise one or more of the following: encoding source image data into a bitstream, encapsulating the encoded bitstream in a container file and/or in packet(s) or stream(s) of a communication protocol, and announcing or describing the bitstream in a content description, such as the Media Presentation Description (MPD) of ISO/IEC 23009-1 (known as MPEG-DASH) or the IETF Session Description Protocol (SDP).
  • MPD Media Presentation Description
  • SDP IETF Session Description Protocol
  • said decoding may comprise one or more of the following: decoding image data from a bitstream, decapsulating the bitstream from a container file and/or from packet(s) or stream(s) of a communication protocol, and parsing a content description of the bitstream, [0243]
  • decoding image data from a bitstream decapsulating the bitstream from a container file and/or from packet(s) or stream(s) of a communication protocol, and parsing a content description of the bitstream
  • the encoder may have structure and/or computer program for generating the bitstream to be decoded by the decoder.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

There are disclosed methods, apparatuses and computer program products for volumetric video compression. In accordance with an embodiment, the method comprises obtaining a presentation comprising volumetric video content; obtaining two or more components of the volumetric video content; packing the two or more components of the volumetric video content into separate atlases video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signaling, in or along the volumetric video bitstream, separate atlas settings and patch information for the two or more different encoded components.

Description

AN APPARATUS, A METHOD AND A COMPUTER PROGRAM FOR VOLUMETRIC VIDEO TECHNICAL FIELD [0001] The present invention relates to an apparatus, a method and a computer program for volumetric video compression. BACKGROUND [0002] In the field of volumetric video compression and 3 degrees-of-freedom and greater (3DoF+) video, a character or a scene captured with a set of depth cameras or synthetically modelled and animated as a three dimensional (3D) scene, can be encoded as a volumetric video. Volumetric video compression typically segments the 3D content into a set of two dimensional (2D) patches containing color and geometry data, which can then be compressed using a standard 2D video compression format. Thus, color and geometry data can be considered as components of volumetric video. Volumetric video compression is currently typically being explored and standardized in the MPEG-I Point Cloud Compression (PCC) and 3DoF+ efforts, for example. [0003] A proposed approach for 3DoF+ volumetric video compression can use 3D scene segmentation to generate view that can be packed into atlases and efficiently encoded using existing 2D compression technologies such as H.265 or H.264. For the end-user to consume such content, a standard metadata format may need to be defined that efficiently describes information required for view synthesis. The current video-based point cloud compression (V-PCC) specification defines similar metadata structures while setting limits to patch packing strategies by defining shared patch layouts for all components (color, depth, etc.) of volumetric video. [0004] Thus, there is an ongoing need for metadata formats and structures that enable application of different patch packing methods for different components of 3D video that result in significantly smaller atlas sizes and overall bitrates. V-PCC standardization in MPEG defines many structures that such an approach can leverage. However, as the separation of patch layouts for color and depth components is not supported, the application of packing strategies for individual components is limited under currently available approaches. Patch packing is a bin packing problem, and many optimizations exist for coming up with an optimal atlas layout (e.g., sprite texture packing). [0005] V-PCC (23090-5), MIV (23090-12) specifications don’t accommodate flexibility to define different patch layouts per video encoded component and encapsulation of V-PCC (23090- 10) does not consider that each video encoded component can have its own metadata (atlas/patch layout) information. As such signalling and storage to enable separation of patch layouts is yet to be explored. Therefore, there is also an ongoing need for approaches for applying different packing methods for each component of volumetric video that defines accompanying metadata format that supports view reconstruction by client devices. SUMMARY [0006] Now, an improved method and technical equipment implementing the method has been invented, by which the above problems are alleviated. Various aspects include a method, an apparatus and a computer readable medium comprising a computer program, or a signal stored therein, which are characterized by what is stated in the independent claims. Various details of the embodiments are disclosed in the dependent claims and in the corresponding images and description. [0007] The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention. [0008] According to an embodiment, there is provided description of signalling related to metadata to separate patch layout for different video encoded components using a V-PCC compatible design. A new 3VC specific SEI (supplementary enhancement information) message for V-PCC bitstream may be used, such as a separate_atlas_component(). The SEI message may be inserted in a NAL stream (referred to as well as atlas bitstream) signalling which component the following or preceding NAL units are applied to. The SEI message may be defined as a prefix or a suffix. In accordance with an embodiment, if the SEI message does not exist in a atlas_sub_bitstream(). NAL units are applied to all related video encoded components. [0009] This kind of design may provide flexibility to signal per component NAL units, which may enable signalling different patch layouts and parameter sets for each video encoded component. [0010] The new SEI message may contain at least a component_type attribute as well as an attribute_type attribute. [0011] By adding indication in a NAL unit header() which video encoded component each NAL unit should be applied to allows flexibility for signalling different patch layouts. [0012] Default value for component type could be assigned to indicate that NAL units are applied to all video encoded components. [0013] Patch layouts may be signalled in separate tracks of timed metadata per video encoded component describing patch layout. [0014] Each layer of atlas contains different patch layout. Each video component or group of video components is assigned to different layer of an atlas (distinguished by nuh_layer_id). Therefore, mapping of atlas layer to video component or group of video components maybe signalled. [0015] The linkage of atlas nuh_layer_id and video component can be done on V-PCC parameter set level (V-PCC unit type of VPCC_VPS), on atlas sequence parameter level or on atlas sequence parameter level. All the parameter sets have an extensions mechanism that can be utilized to provide such information. [0016] An advantage of some embodiments is to improve compression of 3VC V-PCC and 3VC MIV enabling application different packing strategies for different video encoded component types. In addition, the patch layout separation provides flexibility to develop and carry novel packing strategies that are not yet known, thus future proofing the related technologies. [0017] According to a first aspect, there is provided a method comprising obtaining a presentation comprising volumetric video content; obtaining two or more components of the volumetric video content; packing the two or more components of the volumetric video content into separate video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signaling, in or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components. [0018] An apparatus according to a second aspect comprises at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a presentation comprising volumetric video content; obtain two or more components of the volumetric video content; pack the two or more components of the volumetric video content into separate video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signal, in or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components. [0019] An apparatus according to a third aspect comprises means for: obtaining a presentation comprising volumetric video content; obtaining two or more components of the volumetric video content; packing the two or more components of the volumetric video content into separate video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signaling, in or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components. [0020] A method according to a fourth aspect comprises receiving a bitstream in a decoder, said bitstream comprising an encoded presentation of two or more components of a volumetric video content; receiving, from or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components; decoding, from the bitstream, the two or more components of a volumetric video content; depacking the two or more components of the volumetric video content from separate patches by using the settings and patch information. BRIEF DESCRIPTION OF THE DRAWINGS [0021] For a more complete understanding of the example embodiments, reference is now made to the following descriptions taken in connection with the accompanying drawings in which: [0022] Fig.1a shows an encoder and decoder for encoding and decoding omnidirectional video content according to OMAF standard; [0023] Fig.1b shows an example of image stitching, projection and region-wise packing; [0024] Fig.1c shows an example of a process of forming a monoscopic equirectangular panorama picture; [0025] Figs.2a and 2b show a compression and a decompression process for 3D volumetric video; [0026] Figs.3a and 3b show an example of a point cloud frame and a projection of points to a corresponding plane of a point cloud bounding box; [0027] Fig.4 illustrates compression of metadata for volumetric video scene as a video compression pipeline; [0028] Fig.5 shows a flow chart for signaling overlay content according to an embodiment; [0029] Fig.6 illustrates an example of a multi-track ISOBMFF V-PCC container; [0030] Fig.7 illustrates an apparatus of an example embodiment. [0031] Fig.8 shows a schematic diagram of an example multimedia communication system within which various embodiments may be implemented. DETAILED DESCRIPTON OF SOME EXAMPLE EMBODIMENTS [0032] In the following, several embodiments of the invention will be described in the context of omnidirectional video coding and point cloud compressed (PCC) objects. It is to be noted, however, that the invention is not limited to PCC objects. [0033] Volumetric video data represents a three-dimensional scene or object and can be used as input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications. Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. colour, opacity, reflectance, …), plus any possible temporal changes of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video is either generated from 3D models, i.e. computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g. multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Typical representation formats for such volumetric data are triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time. Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR, or MR applications, especially for providing 6DOF viewing capabilities. [0034] Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps. [0035] In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient.2D-video based approaches for compressing volumetric data, i.e. multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities. [0036] Alternatively, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries. These geometries are “unfolded” onto 2D planes with two planes per geometry, one plane for texture and one plane for depth. The 2D planes are then encoded using standard 2D video compression technologies. Relevant projection geometry information is transmitted along with the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format). Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with efficient temporal compression. Thus, coding efficiency is increased. Using geometry-projections instead of prior 2D-video based approaches, e.g. multiview and depth, provide a better coverage of the scene or object. Thus, 3DOF+ (e.g., 6DoF) capabilities are improved. Using several geometries for individual objects further improves the coverage of the scene. Furthermore, standard video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and reverse projection steps are of low complexity. [0037] According to some embodiments, volumetric video compression can often generate an array of patches by decomposing the point cloud data into a plurality of patches. The patches are mapped to a 2D grid and, in some instances, an occupancy map is generated from any of a variety of attributes (such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like), where occupied pixels are pixels which have valid attribute values, e.g., depth values and/or color values. Geometry images, texture images and/or the like may then be generated for subsequent storage and/or transmission. In some embodiments, the compressed images may thereafter be decompressed and the geometry and texture may be reconstructed, such that the image may then be viewed. [0038] In projection-based volumetric video compression, a 3D surface is projected onto a 2D grid. The 2D grid has a finite resolution. Thus, in some embodiments, two or more points of the 3D surface may be projected on the same 2D pixel location. The image generation process exploits the 3D to 2D mapping to store the geometry and texture of the point cloud as images. In order to address multiple points being projected to the same pixel, each patch is projected onto two images, referred to as layers (or maps). In some instances, the first geometry layer is encoded as it is and the second geometry layer is encoded as a delta to the first layer. Texture frames may be generated similarly, but both texture layer 1 and layer 2 may be encoded as separated texture frames. [0039] In an effort to retain the high frequency features, one approach involves absolute coding with reconstruction correction. Another approach to retain the high frequency features involves geometry-based point interpolation. In some embodiments, the compression efficiency of geometry images is improved by replacing some of the geometry information explicitly encoded using geometry images by a point interpolation algorithm. [0040] In contrast to traditional 2D cameras enabling to capture a relatively narrow field of view, three-dimensional (3D) cameras are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output devices, such as head-mounted displays (HMD), and other devices, allow a person to see the 360-degree visual content. [0041] Available media file format standards include International Standards Organization (ISO) base media file format (ISO/IEC 14496-12, which may be abbreviated ISOBMFF), Moving Picture Experts Group (MPEG)-4 file format (ISO/IEC 14496-14, also known as the MP4 format), file format for NAL (Network Abstraction Layer) unit structured video (ISO/IEC 14496-15) and High Efficiency Video Coding standard (HEVC or H.265/HEVC). [0042] Some concepts, structures, and specifications of ISOBMFF are described below as an example of a container file format, based on which the embodiments may be implemented. The aspects of the invention are not limited to ISOBMFF, but rather the description is given for one possible basis on top of which the invention may be partly or fully realized. [0043] In the following, term “omnidirectional” may refer to media content that may have greater spatial extent than a field-of-view of a device rendering the content. Omnidirectional content may for example cover substantially 360 degrees in the horizontal dimension and substantially 180 degrees in the vertical dimension, but omnidirectional may also refer to content covering less than 360 degree-view in the horizontal direction and/or 180 degree-view in the vertical direction. [0044] A panoramic image covering a 360-degree field-of-view horizontally and a 180-degree field-of-view vertically can be represented by a sphere that has been mapped to a two-dimensional image plane using the equirectangular projection (ERP). In this case, the horizontal coordinate may be considered equivalent to a longitude, and the vertical coordinate may be considered equivalent to a latitude, with no transformation or scaling applied. In some cases, panoramic content with a 360-degree horizontal field-of-view, but with less than a 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two-dimensional image plane. In some cases, panoramic content may have less than a 360-degree horizontal field-of-view and up to a 180-degree vertical field-of-view, while otherwise having the characteristics of an equirectangular projection format. [0045] Immersive multimedia, such as omnidirectional content consumption is more complex for the end user compared to the consumption of 2D content. This is due to the higher degree of freedom available to the end user. The freedom also results in more uncertainty. The MPEG Omnidirectional Media Format (OMAF; ISO/IEC 23090-2) v1 standardized the omnidirectional streaming of single 3DoF (3 Degrees of Freedom) content (where the viewer is located at the centre of a unit sphere and has three degrees of freedom (Yaw-Pitch-Roll). The following phase standardization OMAFv2 is expected to enable multiple 3DoF and 3DoF+ content consumption with user interaction and means to optimize the Viewport Dependent Streaming (VDS) operations and bandwidth management. [0046] A viewport may be defined as a region of omnidirectional image or video suitable for display and viewing by the user. A current viewport (which may be sometimes referred simply as a viewport) may be defined as the part of the spherical video that is currently displayed and hence is viewable by the user(s). At any point of time, a video rendered by an application on a head- mounted display (HMD) renders a portion of the 360-degrees video, which is referred to as a viewport. Likewise, when viewing a spatial part of the 360-degree content on a conventional display, the spatial part that is currently displayed is a viewport. A viewport is a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display. A viewport may be characterized by a horizontal field-of-view (VHFoV) and a vertical field-of-view (VVFoV). [0047] The 360-degree space may be divided into a discrete set of viewports, each separated by a given distance (e.g., expressed in degrees), so that the omnidirectional space can be imagined as a map of overlapping viewports, and the viewport is switched discretely as the user changes his/her orientation while watching content with a head-mounted display (HMD). When the overlapping between viewports is reduced to zero, the viewports can be imagined as adjacent non-overlapping tiles (patches) within the 360-degrees space. [0048] When streaming VR video, a subset of 360-degree video content covering the viewport (i.e., the current view orientation) may be transmitted at the best quality/resolution, while the remaining of 360-degree video may be transmitted at a lower quality/resolution. This is what characterizes a VDS systems, as opposed to a Viewport Independent Streaming system, where the omnidirectional video is streamed at high quality in all directions. [0049] Fig.1a illustrates the OMAF system architecture. The system can be situated in a video camera, or in a network server, for example. As shown in Fig.1a, an omnidirectional media (A) is acquired. If the OMAF system is part of the video source, the omnidirectional media (A) is acquired from the camera means. If the OMAF system is in a network server, the omnidirectional media (A) is acquired from a video source over network. A real-world audio-visual scene (A) may be captured 120 by audio sensors as well as a set of cameras or a camera device with multiple lenses and sensors. The acquisition results in a set of digital image/video (Bi) and audio (Ba) signals. The cameras/lenses may cover all directions around the center point of the camera set or camera device, thus the name of 360-degree video. [0050] Audio can be captured using many different microphone configurations and stored as several different content formats, including channel-based signals, static or dynamic (i.e. moving through the 3D scene) object signals, and scene-based signals (e.g., Higher Order Ambisonics). The channel-based signals may conform to one of the loudspeaker layouts defined in CICP (Coding- Independent Code-Points). In an omnidirectional media application, the loudspeaker layout signals of the rendered immersive audio program may be binaraulized for presentation via headphones. [0051] The images (Bi) of the same time instance are stitched, projected, and mapped 121 onto a packed picture (D). [0052] For monoscopic 360-degree video, the input images of one time instance may be stitched to generate a projected picture representing one view. An example of image stitching, projection, and region-wise packing process for monoscopic content is illustrated with Fig.1b. Input images (Bi) are stitched and projected onto a three-dimensional projection structure that may for example be a unit sphere. The projection structure may be considered to comprise one or more surfaces, such as plane(s) or part(s) thereof. A projection structure may be defined as three- dimensional structure consisting of one or more surface(s) on which the captured VR image/video content is projected, and from which a respective projected picture can be formed. The image data on the projection structure is further arranged onto a two-dimensional projected picture (C). The term projection may be defined as a process by which a set of input images are projected onto a projected picture. There may be a pre-defined set of representation formats of the projected picture, including for example an equirectangular projection (ERP) format and a cube map projection (CMP) format. It may be considered that the projected picture covers the entire sphere. [0053] Optionally, a region-wise packing is then applied to map the projected picture (C) onto a packed picture (D). If the region-wise packing is not applied, the packed picture is identical to the projected picture, and this picture is given as input to image/video encoding. Otherwise, regions of the projected picture (C) are mapped onto a packed picture (D) by indicating the location, shape, and size of each region in the packed picture, and the packed picture (D) is given as input to image/video encoding. The term region-wise packing may be defined as a process by which a projected picture is mapped to a packed picture. The term packed picture may be defined as a picture that results from region-wise packing of a projected picture. [0054] In the case of stereoscopic 360-degree video, as shown in an example of Fig.1b, the input images of one time instance are stitched to generate a projected picture representing two views (CL, CR), one for each eye. Both views (CL, CR) can be mapped onto the same packed picture (D), and encoded by a traditional 2D video encoder. Alternatively, each view of the projected picture can be mapped to its own packed picture, in which case the image stitching, projection, and region-wise packing is performed as illustrated in Fig.1b. A sequence of packed pictures of either the left view or the right view can be independently coded or, when using a multiview video encoder, predicted from the other view. [0055] An example of image stitching, projection, and region-wise packing process for stereoscopic content where both views are mapped onto the same packed picture, as shown in Fig. 1a is described next in more detailed manner. Input images (Bi) are stitched and projected onto two three-dimensional projection structures, one for each eye. The image data on each projection structure is further arranged onto a two-dimensional projected picture (CL for left eye, CR for right eye), which covers the entire sphere. Frame packing is applied to pack the left view picture and right view picture onto the same projected picture. Optionally, region-wise packing is then applied to the pack projected picture onto a packed picture, and the packed picture (D) is given as input to image/video encoding. If the region-wise packing is not applied, the packed picture is identical to the projected picture, and this picture is given as input to image/video encoding. [0056] The image stitching, projection, and region-wise packing process can be carried out multiple times for the same source images to create different versions of the same content, e.g. for different orientations of the projection structure. Similarly, the region-wise packing process can be performed multiple times from the same projected picture to create more than one sequence of packed pictures to be encoded. [0057] 360-degree panoramic content (i.e., images and video) cover horizontally the full 360- degree field-of-view around the capturing position of an imaging device. The vertical field-of-view may vary and can be e.g.180 degrees. Panoramic image covering 360-degree field-of-view horizontally and 180-degree field-of-view vertically can be represented by a sphere that has been mapped to a two-dimensional image plane using equirectangular projection (ERP). In this case, the horizontal coordinate may be considered equivalent to a longitude, and the vertical coordinate may be considered equivalent to a latitude, with no transformation or scaling applied. The process of forming a monoscopic equirectangular panorama picture is illustrated in Fig.1c. A set of input images 111, such as fisheye images of a camera array or a camera device with multiple lenses and sensors, is stitched 112 onto a spherical image 113. The spherical image is further projected 114 onto a cylinder 115 (without the top and bottom faces). The cylinder is unfolded 116 to form a two- dimensional projected picture 117. In practice one or more of the presented steps may be merged; for example, the input images may be directly projected onto a cylinder without an intermediate projection onto a sphere. The projection structure for equirectangular panorama may be considered to be a cylinder that comprises a single surface. [0058] In general, 360-degree content can be mapped onto different types of solid geometrical structures, such as polyhedron (i.e. a three-dimensional solid object containing flat polygonal faces, straight edges and sharp corners or vertices, e.g., a cube or a pyramid), cylinder (by projecting a spherical image onto the cylinder, as described above with the equirectangular projection), cylinder (directly without projecting onto a sphere first), cone, etc. and then unwrapped to a two- dimensional image plane. [0059] In some cases panoramic content with 360-degree horizontal field-of-view but with less than 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two-dimensional image plane. In some cases a panoramic image may have less than 360-degree horizontal field-of- view and up to 180-degree vertical field-of-view, while otherwise has the characteristics of equirectangular projection format. [0060] In 360-degree systems, a coordinate system may be defined through orthogonal coordinate axes, such as X (lateral), Y (vertical, pointing upwards), and Z (back-to-front axis, pointing outwards). Rotations around the axes may be defined and may be referred to as yaw, pitch, and roll. Yaw may be defined to rotate around the Y axis, pitch around the X axis, and roll around the Z axis. Rotations may be defined to be extrinsic, i.e., around the X, Y, and Z fixed reference axes. The angles may be defined to increase clockwise when looking from the origin towards the positive end of an axis. The coordinate system specified can be used for defining the sphere coordinates, which may be referred to azimuth ( Φ) and elevation ( θ). [0061] Global coordinate axes may be defined as coordinate axes, e.g. according to the coordinate system as discussed above, that are associated with audio, video, and images representing the same acquisition position and intended to be rendered together. The origin of the global coordinate axes is usually the same as the center point of a device or rig used for omnidirectional audio/video acquisition as well as the position of the observer's head in the three- dimensional space in which the audio and video tracks are located. In the absence of the initial viewpoint metadata, the playback may be recommended to be started using the orientation (0, 0) in (azimuth, elevation) relative to the global coordinate axes. [0062] As mentioned above, the projection structure may be rotated relative to the global coordinate axes. The rotation may be performed for example to achieve better compression performance based on the spatial and temporal activity of the content at certain spherical parts. Alternatively or additionally, the rotation may be performed to adjust the rendering orientation for already encoded content. For example, if the horizon of the encoded content is not horizontal, it may be adjusted afterwards by indicating that the projection structure is rotated relative to the global coordinate axes. The projection orientation may be indicated as yaw, pitch, and roll angles that define the orientation of the projection structure relative to the global coordinate axes. The projection orientation may be included e.g. in a box in a sample entry of an ISOBMFF track for omnidirectional video. [0063] 360-degree panoramic content (i.e., images and video) cover horizontally (up to) the full 360-degree field-of-view around the capturing position of an imaging device. The vertical field-of- view may vary and can be e.g.180 degrees. Panoramic image covering 360-degree field-of-view horizontally and 180-degree field-of-view vertically can be represented by a sphere that has been mapped to a two-dimensional image plane using equirectangular projection (ERP). In this case, the horizontal coordinate may be considered equivalent to a longitude, and the vertical coordinate may be considered equivalent to a latitude, with no transformation or scaling applied. In some cases panoramic content with 360-degree horizontal field-of-view but with less than 180-degree vertical field-of-view may be considered special cases of equirectangular projection, where the polar areas of the sphere have not been mapped onto the two-dimensional image plane. In some cases panoramic content may have less than 360-degree horizontal field-of-view and up to 180-degree vertical field-of-view, while otherwise have the characteristics of equirectangular projection format. [0064] In cube map projection format, spherical video is projected onto the six faces (a.k.a. sides) of a cube. The cube map may be generated e.g. by first rendering the spherical scene six times from a viewpoint, with the views defined by a 90 degree view frustum representing each cube face. The cube sides may be frame-packed into the same frame or each cube side may be treated individually (e.g. in encoding). There are many possible orders of locating cube sides onto a frame and/or cube sides may be rotated or mirrored. The frame width and height for frame-packing may be selected to fit the cube sides "tightly" e.g. at 3x2 cube side grid, or may include unused constituent frames e.g. at 4x3 cube side grid. [0065] A cube map can be stereoscopic. A stereoscopic cube map can e.g. be reached by re- projecting each view of a stereoscopic panorama to the cube map format. [0066] Region-wise packing information may be encoded as metadata in or along the bitstream. For example, the packing information may comprise a region-wise mapping from a pre-defined or indicated source format to the packed picture format, e.g. from a projected picture to a packed picture, as described earlier. [0067] Rectangular region-wise packing metadata may be described as follows: [0068] For each region, the metadata defines a rectangle in a projected picture, the respective rectangle in the packed picture, and an optional transformation of rotation by 90, 180, or 270 degrees and/or horizontal and/or vertical mirroring. Rectangles may, for example, be indicated by the locations of the top-left corner and the bottom-right corner. The mapping may comprise resampling. As the sizes of the respective rectangles can differ in the projected and packed pictures, the mechanism infers region-wise resampling. [0069] Among other things, region-wise packing provides signalling for the following usage scenarios: 1) Additional compression for viewport-independent projections is achieved by densifying sampling of different regions to achieve more uniformity across the sphere. For example, the top and bottom parts of ERP are oversampled, and region-wise packing can be applied to down-sample them horizontally. 2) Arranging the faces of plane-based projection formats, such as cube map projection, in an adaptive manner. 3) Generating viewport-dependent bitstreams that use viewport-independent projection formats. For example, regions of ERP or faces of CMP can have different sampling densities and the underlying projection structure can have different orientations. 4) Indicating regions of the packed pictures represented by an extractor track. This is needed when an extractor track collects tiles from bitstreams of different resolutions. [0070] A guard band may be defined as an area in a packed picture that is not rendered but may be used to improve the rendered part of the packed picture to avoid or mitigate visual artifacts such as seams. [0071] Referring again to Fig.1a, the OMAF allows the omission of image stitching, projection, and region-wise packing and encode the image/video data in their captured format. In his case, images (D) are considered the same as images (Bi) and a limited number of fisheye images per time instance are encoded. [0072] For audio, the stitching process is not needed, since the captured signals are inherently immersive and omnidirectional. [0073] The stitched images (D) are encoded 206 as coded images (Ei) or a coded video bitstream (Ev). The captured audio (Ba) is encoded 122 as an audio bitstream (Ea). The coded images, video, and/or audio are then composed 124 into a media file for file playback (F) or a sequence of an initialization segment and media segments for streaming (Fs), according to a particular media container file format. In this specification, the media container file format is the ISO base media file format. The file encapsulator 124 also includes metadata into the file or the segments, such as projection and region-wise packing information assisting in rendering the decoded packed pictures. [0074] The metadata in the file may include: - the projection format of the projected picture, - fisheye video parameters, - the area of the spherical surface covered by the packed picture, - the- orientation of the projection structure corresponding to the projected picture relative to the global coordinate axes, - region-wise packing information, and - region-wise quality ranking (optional). [0075] Region-wise packing information may be encoded as metadata in or along the bitstream, for example as region-wise packing SEI message(s) and/or as region-wise packing boxes in a file containing the bitstream. For example, the packing information may comprise a region-wise mapping from a pre-defined or indicated source format to the packed picture format, e.g. from a projected picture to a packed picture, as described earlier. The region-wise mapping information may for example comprise for each mapped region a source rectangle (a.k.a. projected region) in the projected picture and a destination rectangle (a.k.a. packed region) in the packed picture, where samples within the source rectangle are mapped to the destination rectangle and rectangles may for example be indicated by the locations of the top-left corner and the bottom-right corner. The mapping may comprise resampling. Additionally or alternatively, the packing information may comprise one or more of the following: the orientation of the three-dimensional projection structure relative to a coordinate system, indication which projection format is used, region-wise quality ranking indicating the picture quality ranking between regions and/or first and second spatial region sequences, one or more transformation operations, such as rotation by 90, 180, or 270 degrees, horizontal mirroring, and vertical mirroring. The semantics of packing information may be specified in a manner that they are indicative for each sample location within packed regions of a decoded picture which is the respective spherical coordinate location. [0076] The segments (Fs) may be delivered 125 using a delivery mechanism to a player. [0077] The file that the file encapsulator outputs (F) is identical to the file that the file decapsulator inputs (F'). A file decapsulator 126 processes the file (F') or the received segments (F's) and extracts the coded bitstreams (E'a, E'v, and/or E'i) and parses the metadata. The audio, video, and/or images are then decoded 128 into decoded signals (B'a for audio, and D' for images/video). The decoded packed pictures (D') are projected 129 onto the screen of a head- mounted display or any other display device 130 based on the current viewing orientation or viewport and the projection, spherical coverage, projection structure orientation, and region-wise packing metadata parsed from the file. Likewise, decoded audio (B'a) is rendered 129, e.g. through headphones 131, according to the current viewing orientation. The current viewing orientation is determined by the head tracking and possibly also eye tracking functionality 127. Besides being used by the renderer 129 to render the appropriate part of decoded video and audio signals, the current viewing orientation may also be used the video and audio decoders 128 for decoding optimization. [0078] The process described above is applicable to both live and on-demand use cases. [0079] At any point of time, a video rendered by an application on a HMD or on another display device renders a portion of the 360-degree video. This portion may be defined as a viewport. A viewport may be understood as a window on the 360-degree world represented in the omnidirectional video displayed via a rendering display. According to another definition, a viewport may be defined as a part of the spherical video that is currently displayed. A viewport may be characterized by horizontal and vertical field of views (FOV or FoV). [0080] A viewpoint may be defined as the point or space from which the user views the scene; it usually corresponds to a camera position. Slight head motion does not imply a different viewpoint. A viewing position may be defined as the position within a viewing space from which the user views the scene. A viewing space may be defined as a 3D space of viewing positions within which rendering of image and video is enabled and VR experience is valid. [0081] An omnidirectional image (360-degree video) may be divided into several regions called tiles. The tiles may have been encoded as motion constrained tiles with different quality/resolution. A client apparatus may request the regions/tiles corresponding to a current viewport of the user with high resolution/quality. As used herein the term omnidirectional may refer to media content that has greater spatial extent than a field-of-view of a device rendering the content. Omnidirectional content may for example cover substantially 360 degrees in horizontal dimension and substantially 180 degrees in vertical dimension, but omnidirectional may also refer to content covering less than 360 degree view in horizontal direction and/or 180 degree view in vertical direction. [0082] The client (e.g. the player) may request the whole 360-degree video/image either with uniform quality, which means a viewport independent delivery, or such that the quality of the video/image in a viewport of the user is higher than the quality of the video/image in the non- viewport part of the scene, which means a viewport dependent delivery. [0083] In the viewport-independent streaming the (requested) 360-degree video may be encoded at different bitrates. Each encoded bitstream may be stored with, for example, ISOBMFF and then segmented based on MPEG-DASH. The whole 360-degree video may be delivered to the client/player uniformly at the same quality. [0084] In the viewport-dependent streaming, the (requested) 360-degree video may be divided into several regions/tiles and encoded as, for example, motion constrained tiles. Each encoded tiled bitstream may be stored with, for example, ISOBMFF and then segmented based on MPEG- DASH. The regions/tiles corresponding to the user's viewport may be delivered at high quality/resolution, whereas other parts of 360-degree video which are not within the user's viewport may be delivered at a lower quality/resolution. [0085] A tile track may be defined as a track that contains sequences of one or more motion- constrained tile sets of a coded bitstream. Decoding of a tile track without the other tile tracks of the bitstream may require a specialized decoder, which may be e.g. required to skip absent tiles in the decoding process. An HEVC tile track specified in ISO/IEC 14496-15 enables storage of one or more temporal motion-constrained tile sets as a track. When a tile track contains tiles of an HEVC base layer, the sample entry type 'hvt1' is used. When a tile track contains tiles of a non-base layer, the sample entry type 'lht1' is used. A sample of a tile track consists of one or more complete tiles in one or more complete slice segments. A tile track is independent from any other tile track that includes VCL NAL units of the same layer as this tile track. A tile track has a 'tbas' track reference to a tile base track. The tile base track does not include VCL NAL units. A tile base track indicates the tile ordering using a 'sabt' track reference to the tile tracks. An HEVC coded picture corresponding to a sample in the tile base track can be reconstructed by collecting the coded data from the tile-aligned samples of the tracks indicated by the 'sabt' track reference in the order of the track references. [0086] A constructed tile set track is a tile set track, e.g. a track according to ISOBMFF, containing constructors that, when executed, result into a tile set bitstream. [0087] A constructor is a set of instructions that, when executed, results into a valid piece of sample data according to the underlying sample format. [0088] An extractor is a constructor that, when executed, copies the sample data of an indicated byte range of an indicated sample of an indicated track. Inclusion by reference may be defined as an extractor or alike that, when executed, copies the sample data of an indicated byte range of an indicated sample of an indicated track. [0089] A full-picture-compliant tile set {track | bitstream} is a tile set {track | bitstream} that conforms to the full-picture {track | bitstream} format. Here, the notation {optionA | optionB} illustrates alternatives, i.e. either optionA or optionB, which is selected consistently in all selections. A full-picture-compliant tile set track can be played as with any full-picture track using the parsing and decoding process of full-picture tracks. A full-picture-compliant bitstream can be decoded as with any full-picture bitstream using the decoding process of full-picture bitstreams. A full-picture track is a track representing an original bitstream (including all its tiles). A tile set bitstream is a bitstream that contains a tile set of an original bitstream but not representing the entire original bitstream. A tile set track is a track representing a tile set of an original bitstream but not representing the entire original bitstream. [0090] A full-picture-compliant tile set track may comprise extractors as defined for HEVC. An extractor may, for example, be an in-line constructor including a slice segment header and a sample constructor extracting coded video data for a tile set from a referenced full-picture track. [0091] An in-line constructor is a constructor that, when executed, returns the sample data that it contains. For example, an in-line constructor may comprise a set of instructions for rewriting a new slice header. The phrase in-line may be used to indicate coded data that is included in the sample of a track. [0092] A full-picture track is a track representing an original bitstream (including all its tiles). [0093] A NAL-unit-like structure refers to a structure with the properties of a NAL unit except that start code emulation prevention is not performed. [0094] A pre-constructed tile set track is a tile set track containing the sample data in-line. [0095] A tile set bitstream is a bitstream that contains a tile set of an original bitstream but not representing the entire original bitstream. [0096] A tile set track is a track representing a tile set of an original bitstream but not representing the entire original bitstream. [0097] Video codec may comprise an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. A video encoder and/or a video decoder may also be separate from each other, i.e. need not form a codec. Typically, encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate). A video encoder may be used to encode an image sequence, as defined subsequently, and a video decoder may be used to decode a coded image sequence. A video encoder or an intra coding part of a video encoder or an image encoder may be used to encode an image, and a video decoder or an inter decoding part of a video decoder or an image decoder may be used to decode a coded image. [0098] The phrase along the bitstream (e.g. indicating along the bitstream) may be defined to refer to out-of-band transmission, signalling, or storage in a manner that the out-of-band data is associated with the bitstream. The phrase decoding along the bitstream or alike may refer to decoding the referred out-of-band data (which may be obtained from out-of-band transmission, signalling, or storage) that is associated with the bitstream. For example, an indication along the bitstream may refer to metadata in a container file that encapsulates the bitstream. [0099] Some embodiments will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout. As used herein, the terms "data," "content," "information," and similar terms may be used interchangeably to refer to data capable of being transmitted, received and/or stored in accordance with embodiments of the present invention. Thus, use of any such terms should not be taken to limit the spirit and scope of embodiments of the present invention. [0100] As used herein, the term 'circuitry' may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and (c) hardware circuit(s) and/or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when needed for operation. This definition of 'circuitry' applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term 'circuitry' also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portions of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term 'circuitry' also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device or other computing or network device. [0101] As defined herein, a "computer-readable storage medium," which refers to a physical storage medium (e.g., volatile or non-volatile memory device), may be differentiated from a "computer-readable transmission medium," which refers to an electromagnetic signal. [0102] As described herein, certain example embodiments generally relate to encoding of volumetric video for compression and a definition of metadata structures and compression methods for individual volumetric video components. For example, a character or scene captured with a set of depth cameras, or a synthetically modelled and animated 3D scene are examples of 3D content that can be encoded as volumetric video. Approaches for volumetric video compression often include segmenting the 3D content into a set of 2D patches containing attributes such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like, which can then be compressed using a standard 2D video compression format. Thus, attributes, e.g., color and geometry data, can be considered as components of volumetric video. Volumetric video compression is currently being explored and standardized as part of the MPEG-I Point Cloud Compression (PCC) and 3DoF+ efforts, for example. [0103] Some approaches described herein for solving 3DoF+ volumetric video compression rely on 3D scene segmentation to generate views that can be packed into atlases and efficiently encoded using existing 2D compression technologies, such as H.265 or H.264. For the end-user to consume such content, a standard metadata format needs to be defined that efficiently describes information required for view synthesis. Some current V-PCC specifications define similar metadata structures while setting limits to patch packing strategies by defining shared patch layouts for all components (attributes such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like) of volumetric video. In some embodiments of the current disclosure, some structures for a metadata format can enable application of different atlas packing methods for different components of 3D video, thus resulting in significantly smaller atlas sizes and overall bitrates. In some embodiments, associated methods can be applied for individual volumetric video components. [0104] V-PCC standardization in MPEG defines many structures that an example embodiment of the disclosed approaches can leverage. However, the separation of patch layouts for attributes such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like is not supported, thus the application of novel packing strategies for individual components is not possible. As described herein, patch packing is a bin packing problem, and many optimizations exist for coming up with an optimal patch layout (e.g., sprite texture packing). According to some embodiments of the current disclosure, different packing methods for each component of volumetric video can be defined that accompany metadata formats that support view reconstruction by client devices. [0105] In some embodiments, a pipeline for 3DoF+ delivery can leverage a level of temporal coherency, which allows for maintaining constant patch layouts for atlases over an entire group of pictures (GoP), typically an intra period. This approach may enable more efficient compression for individual volumetric video components while requiring less frequent metadata updates, among other benefits. In other words, according to some embodiments of the present disclosure, a method can include separating patch layouts of different types of volumetric video component (depth, texture, roughness, normals, etc.). In some embodiments, using the structures in the corresponding metadata as means for storing atlas, frame, or tile characteristic information allows for the selection of one or more packing strategies based on the characteristics of those components. Since any particular packing strategy may not always work for all components, for example when the packing strategy is only applicable for a single component, such an approach can pick a suitable, preferable, best, optimal or other such packing strategy for a particular volumetric video section, frame, atlas, and/or tile individually. In some embodiments, such an approach can yield a single set of metadata for a plurality of frames while the video atlas contains per-frame data. As such, in some embodiments, a volumetric video preparation and packing approach can take advantage of different metadata for different tile characteristics (e.g., color tiles, geometry tiles, etc.), and can employ different packing strategies for different characteristics depending on the content. [0106] V-PCC specification is quite flexible and as such its encapsulation in an ISOBMFF can be done in several ways, e.g. using single-track containers or multi-track containers. In the following description of some example embodiments a multi-track ISOBMFF V-PCC container as shown in Fig.6, is envisioned, where V-PCC units in a V-PCC elementary stream are mapped to individual tracks within the container file based on their types. There are two types of tracks in a multi-track ISOBMFF V-PCC container: V-PCC track and V-PCC component track. [0107] The V-PCC track is a track carrying the volumetric visual information in the V-PCC bitstream, which includes atlas sub-bitstream (the patch information, sequence parameter sets, SEI messages). [0108] V-PCC component tracks are restricted video scheme tracks which carry 2D video encoded data for the occupancy map, geometry, and attribute sub-bitstreams of the V-PCC bitstream. [0109] Tracks belonging to the same V-PCC sequence are time-aligned. Samples that contribute to the same point cloud frame across the different video-encoded V-PCC component tracks and the V-PCC track shall have the same presentation time. The V-PCC patch parameter sets used for such samples shall have a decoding time equal or prior to the composition time of the point cloud frame. In addition, all tracks belonging to the same V-PCC sequence shall have the same implied or explicit edit lists. [0110] Without wishing to be bound by any particular theory, by defining individual packing strategies for each component of volumetric video, the size of component video streams can further be reduced, thus reducing the overall bandwidth requirements for delivering volumetric video content. In some embodiments, compression methods can be applied that only work for a first attribute, such as a depth component, without adversely effecting a second attribute, such as a color quality. In some embodiments, compression methods can be applied that only work for the color quality without adversely affecting the depth component, and/or for other attributes. An example of such method would be to down-scale "flat" depth maps while maintaining full resolution color detail. Other methods and approaches are described herein, however any suitable compression method or combination of compression methods can be applied and is contemplated within the scope of this disclosure. [0111] In some approaches, volumetric video compression can be carried out, generally, in a compression pipeline. In some approaches, individual packing strategies can be applied for different components of the volumetric video in the context of that pipeline. By way of example only, at least some of the tiles comprising an image or the image itself can be packed by way of a first approach into a video stream while the metadata corresponding to the tiles or the image is packed via a second approach into a metadata stream. In some embodiments, a group of pictures (GoP) can be split into frames and each frame can be subdivided into tiles based on an attribute such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like. In some embodiments, a portion of the tiles of a frame can be considered static tiles in that the tile characteristic remains unchanged or changes only within a predetermined range or variance between frames within the GoP. In some approaches, tiles that are static between a plurality of the frames within the GoP can be stored as a single instance of the tile and the associated frames to which the single instance corresponds can be stored in the metadata stream. Such approaches may lead to a reduction in computational complexity of encoding/decoding, decreased transmission bandwidth, and decreased storage requirements when deploying the decoded volumetric video for viewing. In some embodiments, particular metadata format structures can be used to support the particular packing methods described herein. [0112] In the following, some background information related to visual volumetric video-based coding (3VC) will be provided. [0113] In a highest level 3VC metadata is carried in vpcc_units which consist of header and payload pairs. In Table 1 below the syntax for vpcc_unit structure is presented, in Table 2 the syntax for vpcc_unit_header structure is presented, and in Table 3 the syntax for vpcc_unit_payload is presented. Table 1: General V-PCC unit syntax
Figure imgf000024_0001
Table 2: V-PCC unit header syntax
Figure imgf000024_0002
Figure imgf000025_0001
Table 3: VPCC unit payload syntax
Figure imgf000025_0002
[0114] 3VC metadata is contained in atlas_sub_bistream() which may contain a sequence of NAL units including header and payload data. nal_unit_header() is used to define how to process the payload data. NumBytesInNalUnit specifies the size of the NAL unit in bytes. This value is required for decoding of the NAL unit. Some form of demarcation of NAL unit boundaries is necessary to enable inference of NumBytesInNalUnit. One such demarcation method is specified in Annex C (23090-5) for the sample stream format. [0115] 3VC atlas coding layer (ACL) is specified to efficiently represent the content of the patch data. The NAL is specified to format that data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes. A NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet-oriented transport and sample streams is identical except that in the sample stream format specified in Annex C (23090-5) each NAL unit can be preceded by an additional element that specifies the size of the NAL unit. ▪ Table 4: General NAL unit syntax
Figure imgf000026_0001
Table 5: NAL unit header syntax
Figure imgf000026_0002
[0116] In the nal_unit_header() syntax nal_unit_type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 73 of 23090-5. nal_layer_id specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. The value of nal_layer_id shall be in the range of 0 to 62, inclusive. The value of 63 may be specified in the future by ISO/IEC. Decoders conforming to a profile specified in Annex A of the current version of 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of nal_layer_id not equal to 0. [0117] rbsp_byte[ i ] is the i-th byte of an RBSP. An RBSP is specified as an ordered sequence of bytes as follows: The RBSP contains a string of data bits (SODB) as follows: • If the SODB is empty (i.e., zero bits in length), the RBSP is also empty. • Otherwise, the RBSP contains the SODB as follows: o The first byte of the RBSP contains the first (most significant, left-most) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain. o The rbsp_trailing_bits( ) syntax structure is present after the SODB as follows: ▪ The first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any). ▪ The next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit). ▪ When the rbsp_stop_one_bit is not the last bit of a byte-aligned byte, one or more bits equal to 0 (i.e. instances of rbsp_alignment_zero_bit) are present to result in byte alignment. o One or more cabac_zero_word 16-bit syntax elements equal to 0x0000 may be present in some RBSPs after the rbsp_trailing_bits( ) at the end of the RBSP. [0118] Syntax structures having these RBSP properties are denoted in the syntax tables using an "_rbsp" suffix. These structures are carried within NAL units as the content of the rbsp_byte[ i ] data bytes. As an example of a typical content, there are the following attributes: atlas_sequence_parameter_set_rbsp( ), which is used to carry parameters related to a sequence of 3VC frames; atlas_frame_parameter_set_rbsp( ), which is used to carry parameters related to a specific frame. Can be applied for a sequence of frames as well; sei_rbsp( ), used to carry SEI messages in NAL units; and atlas_tile_group_layer_rbsp( ), used to carry patch layout information for tile groups. [0119] When the boundaries of the RBSP are known, the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, right-most) bit equal to 1, and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0. The data necessary for the decoding process is contained in the SODB part of the RBSP. [0120] The following Tables 6 to 10 describe some of the most relevant RBSP syntaxes. Table 6: Atlas tile group layer RBSP syntax
Figure imgf000027_0001
Table 7: Atlas tile group header syntax
Figure imgf000027_0002
Figure imgf000028_0001
Table 8: General atlas tile group data unit syntax
Figure imgf000028_0002
Table 9: Patch information data syntax
Figure imgf000028_0003
Figure imgf000029_0001
Table 10: Patch data unit syntax
Figure imgf000029_0002
Figure imgf000030_0001
[0100] Annex F of 3VC V-PCC specification (23090-5) describes different SEI messages that have been defined for 3VC MIV purposes. SEI messages assist in processes related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non-essential.3VC SEI messages are signalled in sei_rspb() which is documented in Table 11 below. Table 11: Patch data unit syntax
Figure imgf000030_0002
[0122] Non-essential SEI messages are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance. [0123] Specification for presence of non-essential SEI messages is also satisfied when those messages (or some subset of them) are conveyed to decoders (or to a hypothetical reference decoder, HRD) by other means not specified in 3VC V-PCC specification (23090-5). When present in the bitstream, non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5). When the content of a non-essential SEI message is conveyed for the application by some means other than presence within the bitstream, the representation of the content of the SEI message is not required to use the same syntax specified in annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are actually present in the bitstream are counted. [0124] Essential SEI messages are an integral part of the V-PCC bitstream and should not be removed from the bitstream. The essential SEI messages are categorized into two types: Type-A essential SEI messages and Type-B essential SEI messages. [0125] Type-A essential SEI messages contain information required to check bitstream conformance and for output timing decoder conformance. Every V-PCC decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance. [0126] V-PCC decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type-B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes. [0127] In the following, some embodiments regarding signalling related to metadata for separating patch layout for different video encoded components using V-PCC compatible design will be explained. One topic of this idea is to enable signalling and storage of different patch layouts for different video encoded components for volumetric video. In some embodiments, designs aim to maintain compatibility within 3VC with regards to the V-PCC and MIV design, but ideas outside of this limitation are also explored. [0128] In accordance with an embodiment, a new 3VC SEI message may be used to signal separate atlas settings for different encoded video components. Such a new SEI message may require reserving a new id in 3VC specification. The SEI message may be signalled in an atlas_sub_bitstream() which contains NAL units that describe atlas related metadata. The SEI message may be inserted in the bitstream to signal that specific NAL units should only be applied on a specific component of the volumetric video stream. The signalling may be done flexibly before or after any relevant NAL units using SEI prefix and suffix functionality as described in 23090-5. [0129] This design introduces minimal changes to 3VC bitstream and focuses on reusing syntax elements which have been designed to describe shared patch layout. The decoder should be able to interpret this SEI message, if it is present in the bitstream, because rendering of 3VC content may fail otherwise as incorrect patch data structure is expected. Processing-wise the parser will parse atlas_sub_bitstream() as usual and process NAL units. Only when it encounters the new SEI message will it apply following or preceding NAL unit(s) only to a specific component or attribute of the atlas. [0130] In one embodiment the SEI message may be used to apply to atlas_sequence_parameter_set_rbsp( ), atlas_frame_parameter_set_rbsp() or other NAL unit level parameter to enable signalling of different size video components or other atlas sequence and frame level settings. [0131] In another embodiment the SEI message may be applied to other NAL units like atlas_tile_group_layer_rbsp() to signal different atlas patch layouts per video encoded component or attribute id. [0132] In another embodiment the SEI message may be applied to other SEI messages found in atlas_sub_bitstream() to signal application of other SEI messages per video encoded component or attribute id. [0133] In another embodiment the SEI message may be used for any other type of NAL units found in atlas_sub_bitstream() to signal different values per video encoded component or attribute type. [0134] The presence of this new SEI message allows signalling shared patch layouts by default. In a bitstream NAL units for shared components may be signalled simply by not using the new SEI message. [0135] In accordance with an embodiment, by default these settings are applied to all video encoded components and only if the new SEI message is encountered the settings are applied to the specific component as indicated by the SEI message itself. [0136] Structurally the new SEI message may be described as described by the following Table 12, in accordance with an embodiment:
Figure imgf000032_0001
[0137] The attribute component_type signals which atlas component the following or preceding NAL units should be applied to. Only values greater than 1 may be considered valid. [0138] The attribute attribute_index should only be processed if the attribute component_type equals 4. This attribute allows signalling of different patch layouts for different attribute types. However, it should be noted that also other component_type value than 4 may be used to indicate that the attribute_index should be processed. [0139] From file encapsulation point of view the signalling does not need to change and all metadata may be stored. The design will maintain compatibility with a design where atlas_sub_bitstream() is stored in a single track per atlas. [0140] In another embodiment, the separate_atlas_component() SEI message may be stored as sample auxiliary information in the ISOBMFF storage structure and be associated with the samples that are relevant to the applied atlas component. Such an approach may assure that the SEI message is distributed with the samples and such SEI messages may be inserted at the beginning of the relevant sample media data, before being fed to the V-PCC decoder. [0141] In the following, some examples of signalling using NAL units will be described. [0142] In one embodiment, the component type and attribute id may be included in a NAL unit header. One benefit of signalling component type and attribute id inside NAL unit header is that it maintains ability to store layout data of different components inside the same atlas_sub_bitstream() which may be stored in a single ISOBMFF track. From storage point of view the encapsulation of said track does not have to change. [0143] Processing of atlas_sub_bitstream() will follow principles as explained for SEI message above. A default value for the component_type attribute in a NAL unit header may be used to signal that the NAL unit payload is applicable to all atlas components. A default value may be used to signal shared atlas payloads, for example depth and occupancy components. A NAL unit with component_type equal to 4 (or another predetermined value) may be signalled in addition to provide different patch layout for texture component. Different settings for different attributes may be signalled using attribute_id field in NAL unit header. [0144] Structurally the NAL unit header may have the following changes, highlighted with a grey background in Table 13 (as well as in the other succeeding Tables): Table 13
Figure imgf000033_0001
[0145] In this structure, the attributes have the following specification: [0146] nal_component_type shall signal the atlas component to which the NAL payloads should be applied to. Only values greater than 1 may be considered valid. In accordance with an embodiment, signalling component_type with value 0 indicates that the NAL unit payload will be applicable to all components. [0147] nal_attribute_index shall only be processed if the component_type equals 4 (or another predetermined value, as was mentioned above). This attribute allows signalling of different patch layouts for different attribute types. [0148] In the following, signalling different component layouts using the syntax structure vpcc_unit_header() will be described. [0149] In one embodiment component type and attribute id may be signalled in vpcc_unit_header which allows to store separate metadata tracks for each video encoded component and attribute. The vpcc_unit_header() definition in the current 3VC specification does not support signalling different metadata for different component or attribute types. [0150] In one embodiment vpcc units with different headers for metadata (vuh_unit_type == VPCC_AD) may be stored in the same track. This may require changes in current 23090-10 specification. Such changes may include storing stream of vpcc units in track samples instead of atlas_sub_bitstream() as currently specified. [0151] The Table 12 below illustrates changes for vpcc_unit_header() structure, highlighted with grey background. Table 14
Figure imgf000034_0001
[0152] In this structure, the attributes have the following specification: [0153] vuh_component_type shall signal the atlas component to which the unit payload should be applied to. Only values greater than 1 may be considered valid. Signalling component_type with value 0 indicates that the vpcc unit payload will be applicable to all components. [0154] nal_attribute_index shall only be processed if component_type equals 4. This attribute allows signalling of different patch layouts for different attribute types. [0155] In the following, signalling mapping of atlas layer to video component or group of video components will be described, in accordance with an embodiment. [0156] The value of nal_layer_id of an atlas NAL unit should be in the range of 0 to 62, inclusive. However, it should be noted that also other valid value range may be defined within which the value of nal_layer_id of an atlas NAL unit should be. [0157] From file encapsulation point of view the signalling does not need to change and all metadata may be stored as described in 23090-10. The design will maintain compatibility with a design where atlas_sub_bitstream() is stored in single track per atlas. [0158] In one embodiment an atlas NAL sub-bitstream contains more than one layer and each layer is identified by a nuh_layer_id. The mapping between the video components and nuh_layer_id is provided in V-PCC parameter set. The following Table 15 illustrates an example syntax for the signalling of mapping of atlas layer to video component or group of video components. Table 15
Figure imgf000035_0001
[0159] ]ps_atlas_count_minus1 is a syntax element defined in vpcc_parameter_set( ). [0160] alm_occupancy_to_atlas_layer_id[ j ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an occupancy of atlas with index j. [0161] alm_geometry_to_atlas_layer_id[ j ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for a geometry of atlas with index j. [0162] ai_attribute_count is a syntax element defined in attribute_information(). [0163] alm_attribute_to_atlas_layer_id[ j ][ i ] indicates an atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an attribute with index i of atlas with index j. Table 16
Figure imgf000036_0001
Figure imgf000037_0001
Figure imgf000037_0004
[0164] vps_flexible_atlas_packing_flag indicates that flexible packing per each video component or group of components is supported. [0165] In one embodiment atlas NAL sub-bitstream contains more than one layer and each layer is identified by a nuh_layer_id. The mapping between the video components and nuh_layer_id is provided in atlas sequence parameter set. Table 17
Figure imgf000037_0002
[0166] alm_occupancy_to_atlas_layer_id indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an occupancy video sub-bitstream. [0167] alm_geometry_to_atlas_layer_id[ j ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for a geometry video sub-bitstream. [0168] ai_attribute_count is syntax element defined in attribute_information() for each atlas. [0169] alm_attribute to_atlas_ layer_id[ i ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an attribute video sub-bitstream with index i of this atlas. Table 18
Figure imgf000037_0003
Figure imgf000038_0001
[0170] asps_flexible_atlas_packing_flag indicates that flexible packing per each video component or group of components is supported. [0171] In one embodiment atlas NAL sub-bitstream contains more than one layer and each layer is identified by a nuh_layer_id. The mapping between the video components and nuh_layer_id is provided in an atlas frame parameter set as is illustrated in the following Table 19, in accordance with an embodiment. Table 19
Figure imgf000039_0001
[0172] alm_occupancy_to_atlas_layer_id indicates an atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an occupancy video sub-bitstream. [0173] alm_geometry_to_atlas_layer_id[ j ] indicates atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for a geometry video sub-bitstream. [0174] ai_attribute_count is a syntax element defined in attribute_information() for each atlas. [0175] alm_attribute_to_atlas_ layer_id[ i ] indicates an atlas layer id (i.e. nuh_layer_id of atlas NAL unit header) in which patch information is carried for an attribute video sub-bitstream with index i of this atlas. Table 20
Figure imgf000039_0002
Figure imgf000040_0001
[0176] afps_flexible_atlas_packing_flag indicates that flexible packing per each video component or group of components is supported. [0177] In another embodiment vpcc units with different headers for metadata (vuh_unit_type == VPCC_AD) may be stored in separate tracks. In this case the vpcc_header for each component or attribute may be stored in a sample entry of said track and sub_atlas_bitstream() of said units may be stored in samples of individual tracks. [0178] In another embodiment, atlas_layer_mapping() information may be signalled using a separate timed metadata track where samples of this timed metadata track is aligned with the related video tracks. Such a timed media track may be associated with the relevant 3VC media tracks’ DASH representations via@associationId and @associationType attributes. [0179] Referring now to Fig.4, the basic compression of a volumetric video scene is illustrated as a video compression pipeline 100. Generally, each frame of an input 3D scene 101 can be processed separately, and the resulting per-frame atlas and metadata are then stored into separate video and metadata streams, respectively. In some embodiments, the frames of the input 3D scene 101 can be processed using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7). [0180] In some embodiments, the input 3D scene 101 is converted, at Input Conversion 102 into a canonical representation for processing. According to some embodiments, each frame of the input 3D scene 101 is converted at Input Conversion 102 into a collection of 3D samples of a scene geometry, at a specified internal processing resolution. Depending on the input 3D scene 101, this may involve, e.g., voxelizing a mesh model, or down-sampling a high resolution point cloud with very fine details into the processing resolution. In some embodiments, the internal representation resulting from the Input Conversion 102 is a point cloud representing some or all aspects of the 3D input scene 101. By way of example only, the aspects of the 3D input scene 101 can include but are not limited to attributes such as one of a color attribute, a depth attribute, a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like of the 3D scene 101. In some embodiments, the input 3D scene 101 can be converted into, for example, a canonical representation using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7). [0181] In some embodiments, a View Optimizer 103 creates, from the internal point cloud format resulting from the Input Conversion 102, a segmentation of the 3D scene 101 optimized for a specified viewing constraint (e.g., the viewing volume). In some embodiments, the View Optimizer 103 process can involve creating view-tiles that have sufficient coverage and resolution for representing the original input 3D scene 101 while incurring a minimal quality degradation within the given viewing constraints. In some embodiments, the View Optimizer 103 can make use of at least a 3D position of points in the internal point cloud of the 3D scene 101. In some embodiments, additional attributes such as a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the like may also or alternatively be considered. In some embodiments, the View Optimizer 103 can be fully or partially instantiated using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7). [0182] In some embodiments, as the tiles are defined for each frame of the GoD, View-tile Metadata 104 can be defined that describes each tile of the frame (e.g., tile geometry, material, color, depth, etc.). In some embodiments, the resulting view-tiles can then be pre-rendered in a View-tile Rendering 105. In some embodiments, View-tile Rendering 105 can include resampling the input point cloud into one or more 2D tile projections, and/or calling an external renderer, e.g. a path tracing renderer, to render views of the original input 3D scene 101. In some embodiments, the tiles can be defined, characterized, and/or converted to metadata using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7). [0183] In other words, according to some embodiments of the present disclosure, separate layouts of different types of volumetric video component atlases (depth, texture, roughness, normals, etc.) can be generated for a GoF, one or more frames of the GoF, and/or tiles of a particular frame. In some embodiments, using the structures in the corresponding metadata as means for storing atlas, frame, or tile characteristic information allows for the selection of one or more packing strategies based on the characteristics of those components. Since any particular packing strategy may not always work for all components, for example when the packing strategy is only applicable for a single component, such an approach can pick a suitable, preferable, best, optimal or other such packing strategy for a particular volumetric video section, frame, atlas, and/or tile individually. In some embodiments, such an approach can yield a single set of metadata for a plurality of frames while the video atlas contains per-frame data. As such, in some embodiments, a volumetric video preparation and packing approach can take advantage of different metadata for different tile characteristics (e.g., color tiles, geometry tiles, etc.), and can employ different packing strategies for different characteristics depending on the content. [0184] In some embodiments, the rendered tiles can then be input into an Atlas Packer 106. In some embodiments, the Atlas Packer 106 can produce an optimal 2D layout of the rendered view- tiles. In some embodiments, the Atlas Packer 106 can pack the pre-rendered tiles into video frames. In some embodiments, additional metadata may be required to unpack and re-render the packed tiles. In some embodiments, when such metadata is required to unpack and re-render the packed tiles, such additional metadata can be generated by the Atlas Packer 106. In some embodiments, the Atlas Packer 106 can carry out alternative or additional processing procedures such as down- sampling of certain tiles, re-fragmentation of tiles, padding, dilation and the like. In some embodiments, the Atlas Packer 106 can be configured to pack the scene into an atlas format that minimizes unused pixels. In some embodiments, the Atlas Packer 106 can provide guards for artifacts that might occur in a compression stage. In some embodiments, the packed atlases can then be piped to Video Compression 107 to generate a final compressed representation of the 3D scene 101. In some embodiments, the final compressed representation of the 3D scene 101 can include compressed view-tiles 108 and corresponding view-tile metadata 104. In some embodiments, the Atlas Packer 106 and/or the Video Compression 107 can be fully or partially instantiated using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7). [0185] In some embodiments, after content pre-processing of the 3D scene 101 into the compressed view-tiles 108 and corresponding view-tile metadata 104, the pipeline 100 can include processes for content delivery and view (e.g., real-time viewing). In some embodiments, in order to produce the tiles in a proper format for viewing, the compressed video frames (compressed view- tiles 108) and the view-tile metadata 104 can be used for View Synthesis 109 of novel views of the 3D scene 101. In some embodiments, the view-tile metadata 104 can contain some or all of the necessary information for View Synthesis 109 (a view synthesizer) to employ any suitable rendering method or combination of rendering methods, such as point cloud rendering, mesh rendering, or ray-casting, to reconstruct a view of the scene from any given 3D viewpoint (assuming the originally specified viewing constraints). In some embodiments, by processing the 3D scene 101 according to the pipeline 100 illustrated in Fig.4, real-time 3D viewing 110 of the volumetric video can be achieved. In some embodiments, the View Synthesis 109 can be fully or partially instantiated using any suitable means, apparatus, or device, such as a codec or processing circuitry (such as discussed below with regard to Fig.7). [0186] In some embodiments, the Atlas Packer 106 receives as an input at least a list of views and pre-rendered tiles representing those views for at least depth and color components. However, the Atlas Packer 106 is not limited to color and/or depth components only. Rather, other volumetric video components such as a geometry attribute, a reflectance attribute, a roughness attribute, a transparency attribute, a metalness attribute, a specularity attribute, a surface normals attribute, a material attribute of the volumetric video scene, and the likecan be processed using the same or a similar approach. In some embodiments designated GoP by GoP, the Atlas Packer 106 can process some or all components of the volumetric video in parallel, leveraging dependencies between them and outputting one or more individual atlases for each component packed with pre-rendered tiles. [0187] In some embodiments, while the Atlas Packer 106 may leverage dependencies between different volumetric video components, it may also apply compression methods to each component individually, resulting in different patch layouts and sizes for each component. In some embodiments, therefore, different packing methods can be applied for different components individually or to all or several components separately. [0188] In some embodiments, an apparatus can be configured to carry out some or all portions of any of the methods described herein. The apparatus may be embodied by any of a wide variety of devices including, for example, a video codec. A video codec includes an encoder that transforms input video into a compressed representation suited for storage and/or transmission and/or a decoder that can decompress the compressed video representation so as to result in a viewable form of a video. Typically, the encoder discards some information from the original video sequence so as to represent the video in a more compact form, such as at a lower bit rate. As an alternative to a video codec, the apparatus may, instead, be embodied by any of a wide variety of computing devices including, for example, a video encoder, a video decoder, a computer workstation, a server or the like, or by any of various mobile computing devices, such as a mobile terminal, e.g., a smartphone, a tablet computer, a video game player, etc. Alternatively, the apparatus may be embodied by an image capture system configured to capture the images that comprise the volumetric video data. [0189] Regardless of the video codec or other type of computing device that embodies the apparatus, the apparatus 10 of an example embodiment is depicted in Fig.7 and includes, is associated with, or is otherwise in communication with processing circuitry 12, a memory 14 and a communication interface 16. The processing circuitry may be in communication with the memory device via a bus for passing information among components of the apparatus. The memory device may be non-transitory and may include, for example, one or more volatile and/or non-volatile memories. In other words, for example, the memory device may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processing circuitry). The memory device may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present disclosure. For example, the memory device could be configured to buffer input data for processing by the processing circuitry. Additionally or alternatively, the memory device could be configured to store instructions for execution by the processing circuitry. [0190] The apparatus 10 may, in some embodiments, be embodied in various computing devices as described above. However, in some embodiments, the apparatus may be embodied as a chip or chip set. In other words, the apparatus may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The apparatus may therefore, in some cases, be configured to implement an embodiment of the present invention on a single chip or as a single “system on a chip.” As such, in some cases, a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein. [0191] The processing circuitry 12 may be embodied in a number of different ways. For example, the processing circuitry may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other circuitry including integrated circuits such as, for example, an ASIC (application specific integrated circuit), an FPGA (field programmable gate array), a microcontroller unit (MCU), a hardware accelerator, a special- purpose computer chip, or the like. As such, in some embodiments, the processing circuitry may include one or more processing cores configured to perform independently. A multi-core processing circuitry may enable multiprocessing within a single physical package. Additionally or alternatively, the processing circuitry may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading. [0192] In an example embodiment, the processing circuitry 12 may be configured to execute instructions stored in the memory device 34 or otherwise accessible to the processing circuitry. Alternatively or additionally, the processing circuitry may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processing circuitry may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present disclosure while configured accordingly. Thus, for example, when the processing circuitry is embodied as an ASIC, FPGA or the like, the processing circuitry may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processing circuitry is embodied as an executor of instructions, the instructions may specifically configure the processor to perform the algorithms and/or operations described herein when the instructions are executed. However, in some cases, the processing circuitry may be a processor of a specific device (e.g., an image or video processing system) configured to employ an embodiment of the present invention by further configuration of the processing circuitry by instructions for performing the algorithms and/or operations described herein. The processing circuitry may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processing circuitry. [0193] The communication interface 16 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data, including visual content in the form of video or image files, one or more audio tracks or the like. In this regard, the communication interface may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the communication interface may alternatively or also support wired communication. As such, for example, the communication interface may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms. [0194] An attribute picture may be defined as a picture that comprises additional information related to an associated texture picture. An attribute picture may for example comprise surface normal, opacity, or reflectance information for a texture picture. A geometry picture may be regarded as one type of an attribute picture, although a geometry picture may be treated as its own picture type, separate from an attribute picture. [0195] Texture picture(s) and the respective geometry picture(s), if any, and the respective attribute picture(s) may have the same or different chroma format. [0196] Terms texture image, texture picture and texture component picture may be used interchangeably. Terms geometry image, geometry picture and geometry component picture may be used interchangeably. A specific type of a geometry image is a depth image. Embodiments described in relation to a geometry image equally apply to a depth image, and embodiments described in relation to a depth image equally apply to a geometry image. Terms attribute image, attribute picture and attribute component picture may be used interchangeably. A geometry picture and/or an attribute picture may be treated as an auxiliary picture in video/image encoding and/or decoding. [0197] Figs.2a and 2b illustrate an overview of exemplified compression/ decompression processes of Point Cloud Coding (PCC) according to MPEG standard. MPEG Video-Based Point Cloud Coding (V-PCC)(MPEG N18892, a.k.a. ISO/IEC JTC 1/SC 29/WG 11 “V-PCC Codec Description”) discloses a projection-based approach for dynamic point cloud compression. For the sake of illustration, some of the processes related to video-based point cloud compression (V-PCC) compression/decompression are described briefly herein. For a comprehensive description of the model, a reference is made to MPEG N18892. [0198] The process starts with an input frame representing a point cloud frame 200 that is provided for patch generation 202, geometry image generation 206 and texture image generation 208. Each point cloud frame represents a dataset of points within a 3D volumetric space that has unique coordinates and attributes. An example of a point cloud frame is shown on Figure 3a. [0199] The patch generation process 202 decomposes the point cloud frame 200 by converting 3d samples to 2d samples on a given projection plane using a strategy that provides the best compression. The patch generation process 202 aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. [0200] The geometry image generation 206 and the texture image generation 208 are configured to generate geometry images and texture images. The image generation process exploits the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch is projected onto two images, referred to as layers. More precisely, let H(u,v) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called the near layer, stores the point of H(u,v) with the lowest depth D0. The second layer, referred to as the far layer, captures the point of H(u,v) with the highest depth within the interval [D0, D0+ ^], where ^ is a user-defined parameter that describes the surface thickness. [0201] The generated videos have the following characteristics: geometry: WxH YUV420-8bit, where the geometry video is monochromatic, and texture: WxH YUV420-8bit, where the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points. [0202] The geometry images and the texture images may be provided to image padding 212. The image padding 212 may also receive as an input an occupancy map (OM) 210 to be used with the geometry images and texture images. The padding process aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. V- PCC uses a simple padding strategy, which proceeds as follows: ~ Each block of TxT (e.g., 16x16) pixels is processed independently. ~ If the block is empty (i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. ~ If the block is full (i.e., no empty pixels), nothing is done. ~ If the block has both empty and filled pixels (i.e. a so-called edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors. [0203] The padded geometry images and padded texture images may be provided for video compression 214. The generated images/layers are stored as video frames and compressed using a video codec, such as High Efficiency Video Coding (HEVC) codec. The video compression 214 also generates reconstructed geometry images to be provided for smoothing 216, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 202. The smoothed geometry may be provided to texture image generation 208 to adapt the texture images. [0204] In the auxiliary patch information compression 218, the following meta data is encoded/decoded for every patch: ~ Index of the projection plane o Index 0 for the normal planes (1.0, 0.0, 0.0) and (-1.0, 0.0, 0.0) o Index 1 for the normal planes (0.0, 1.0, 0.0) and (0.0, -1.0, 0.0) o Index 2 for the normal planes (0.0, 0.0, 1.0) and (0.0, 0.0, -1.0). ~ 2D bounding box (u0, v0, u1, v1) ~ 3D location (x0, y0, z0) of the patch represented in terms of depth ^0, tangential shift s0 and bi-tangential shift r0. According to the chosen projection planes, ( ^0, s0, r0) are computed as follows: o Index 0, δ0= x0, s0=z0 and r0 = y0 o Index 1, δ0= y0, s0=z0 and r0 = x0 o Index 2, δ0= z0, s0=x0 and r0 = y0 [0205] Also, mapping information providing for each TxT block its associated patch index is encoded as follows: For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches. The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks. Let I be index of the patch to which belongs the current TxT block and let J be the position of I in L. Instead of explicitly encoding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency. [0206] The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. Herein, one cell of the 2D grid produces a pixel during the image generation process. When considering an occupancy map as an image, it may be considered to comprise occupancy patches. Occupancy patches may be considered to have block-aligned edges according to the auxiliary information described in the previous section. An occupancy patch hence comprises occupancy information for a corresponding texture and geometry patches. [0207] The occupancy map compression leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0). The remaining blocks are encoded as follows. [0208] The occupancy map could be encoded with a precision of a B0xB0 blocks. B0 is a user-defined parameter. In order to achieve lossless encoding, B0 should be set to 1. In practice B0=2 or B0=4 result in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map. The generated binary image covers only a single colour plane. However, given the prevalence of 4:2:0 codecs, it may be desirable to extend the image with “neutral” or fixed value chroma planes (e.g. adding chroma planes with all sample values equal to 0 or 128, assuming the use of an 8-bit codec). [0209] The obtained video frame is compressed by using a video codec with lossless coding tool support (e.g., AVC, HEVC RExt, HEVC-SCC). [0210] Occupancy map is simplified by detecting empty and non-empty blocks of resolution TxT in the occupancy map and only for the non-empty blocks we encode their patch index as follows: ~ A list of candidate patches is created for each TxT block by considering all the patches that contain that block. ~ The list of candidates is sorted in the reverse order of the patches. For each block, o If the list of candidates has one index, then nothing is encoded. o Otherwise, the index of the patch in this list is arithmetically encoded. [0211] The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let ( δ0, s0, r0) be the 3D location of the patch to which it belongs and (u0, v0, u1, v1) its 2D bounding box. P could be expressed in terms of depth δ (u, v), tangential shift s(u, v) and bi- tangential shift r(u, v) as follows: δ(u, v) = δ0 + g(u, v) s(u, v) = s0 – u0 + u r(u, v) = r0 – v0 + v where g(u, v) is the luma component of the geometry image. [0212] The smoothing procedure 216 aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. [0213] In the texture reconstruction process, the texture values are directly read from the texture images. [0214] A multiplexer 220 may receive a compressed geometry video and a compressed texture video from the video compression 214, entropy compression 222, and optionally a compressed auxiliary patch information from auxiliary patch-info compression 218. The multiplexer 220 uses the received data to produce a compressed bitstream. [0215] Figure 2b illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 250 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 252. In addition, the de-multiplexer 250 transmits compressed occupancy map to occupancy map decompression 254. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 256. Decompressed geometry video from the video decompression 252 is delivered to geometry reconstruction 258, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 258 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images. [0216] The reconstructed geometry image may be provided for smoothing 260, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 262, which also receives a decompressed texture video from video decompression 252. The texture values for the texture reconstruction are directly read from the texture images. The texture reconstruction 262 outputs a reconstructed point cloud for color smoothing 264, which further provides the reconstructed point cloud. [0217] A V-PCC bitstream, containing coded point cloud sequence (CPCS), is composed of VPCC units carrying V-PCC parameter set (VPS) data, an atlas information bitstream, a 2D video encoded occupancy map bitstream, a 2D video encoded geometry bitstream, and zero or more 2D video encoded attribute bitstreams. A V-PCC bitstream can be stored in ISOBMFF container according to ISO/IEC 23090-10. Two modes are supported: single-track container and multi-track container. [0218] Single-track container is utilized in the case of simple ISOBMFF encapsulation of a V- PCC encoded bitstream. In this case, a V-PCC bitstream is directly stored as a single track without further processing. Single-track should use sample entry type of 'vpe1' or 'vpeg'. [0219] Under the 'vpe1' sample entry, all atlas parameter sets (as defined in ISO/IEC 23090-5) are stored in the setupUnit of sample entry. Under the 'vpeg' sample entry, the atlas parameter sets may be present in setupUnit array of sample entry, or in the elementary stream. [0220] Multi-track container maps V-PCC units of a V-PCC elementary stream to individual tracks within the container file based on their types. There are two types of tracks in a multi-track container: V-PCC track and V-PCC component track. The V-PCC track is a track carrying the volumetric visual information in the V-PCC bitstream, which includes the atlas sub-bitstream and the atlas sequence parameter sets. V-PCC component tracks are restricted video scheme tracks which carry 2D video encoded data for the occupancy map, geometry, and attribute sub-bitstreams of the V-PCC bitstream. Multi-track should use for V-PCC track sample entry type of 'vpc1'or 'vpcg'. [0221] Under the 'vpc1' sample entry, all atlas parameter sets (as defined in ISO/IEC 23090-5) shall be in the setupUnit array of sample entry. Under the 'vpcg' sample entry, the atlas parameter sets may be present in this array, or in the stream. [0222] Consequently, V-PCC provides a procedure for compressing a time-varying volumetric scene/object by projecting 3D surfaces onto a number of pre-defined 2D planes, which may then be compressed using regular 2D video compression algorithms. The projection is presented using different patches, where each set of patches may represent a specific object or specific parts of a scene. Especially, V-PCC provides for efficient delivery of a compressed 3D point cloud object which can be viewed with six degrees of freedom (6DoF). [0223] Some examples and embodiments have been described above with reference to specific syntax and semantics. It needs to be understood that the examples and embodiments similarly apply to any similar syntax structures. For example, examples or embodiments apply similarly when some syntax elements are absent, or the order of syntax elements differs. [0224] Two or more of the embodiments as described above may be combined, and they may be introduced as one or more indicators in any suitable syntax structure. [0225] The embodiments relating to the encoding aspects may be implemented in an apparatus comprising means for: obtaining a presentation comprising volumetric video content; obtaining two or more components of the volumetric video content; packing the two or more components of the volumetric video content into separate atlases to obtain two or more different encoded components of the volumetric video content; and signaling, in or along the bitstream, separate atlas settings for the two or more different encoded components. [0226] The embodiments relating to the encoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a presentation comprising volumetric video content; obtain two or more components of the volumetric video content; pack the two or more components of the volumetric video content into separate atlases to obtain two or more different encoded components of the volumetric video content; and signal, in or along the bitstream, separate atlas settings for the two or more different encoded components. [0227] The embodiments relating to the decoding aspects may be implemented in an apparatus comprising means for: receiving a bitstream in a decoder, said bitstream comprising an encoded presentation of two or more components of a volumetric video content; receiving, from or along the bitstream, separate atlas settings for the two or more different encoded components; decoding, from the bitstream, the two or more components of a volumetric video content; depacking the two or more components of the volumetric video content from separate atlases. [0228] The embodiments relating to the decoding aspects may likewise be implemented in an apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive a bitstream in a decoder, said bitstream comprising an encoded presentation of two or more components of a volumetric video content; receive, from or along the bitstream, separate atlas settings for the two or more different encoded components; decode, from the bitstream, the two or more components of a volumetric video content; depack the two or more components of the volumetric video content from separate atlases. [0229] Such apparatuses may comprise e.g. the functional units disclosed in any of the Figures 1a, 2a and 2b for implementing the embodiments. [0230] Fig.5 shows a flow chart for signaling overlay content according to an embodiment. In 500 a presentation comprising volumetric video content is obtained. In 502 two or more components of the volumetric video content are obtained. The two or more components of the volumetric video content are packed 504 into separate atlases video encoded bitstreams to obtain two or more different encoded components of the volumetric video content. In or along the volumetric video bitstream, separate atlas settings and patch information for the two or more different encoded components are signaled 506. [0231] Fig.8 shows a schematic diagram of an example multimedia communication system within which various embodiments may be implemented. [0232] Figure 8 is a graphical representation of an example multimedia communication system within which various embodiments may be implemented. A data source 1510 provides a source signal in an analog, uncompressed digital, or compressed digital format, or any combination of these formats. An encoder 1520 may include or be connected with a pre-processing, such as data format conversion and/or filtering of the source signal. The encoder 1520 encodes the source signal into a coded media bitstream. It should be noted that a bitstream to be decoded may be received directly or indirectly from a remote device located within virtually any type of network. Additionally, the bitstream may be received from local hardware or software. The encoder 1520 may be capable of encoding more than one media type, such as audio and video, or more than one encoder 1520 may be required to code different media types of the source signal. The encoder 1520 may also get synthetically produced input, such as graphics and text, or it may be capable of producing coded bitstreams of synthetic media. In the following, only processing of one coded media bitstream of one media type is considered to simplify the description. It should be noted, however, that typically real-time broadcast services comprise several streams (typically at least one audio, video and text sub-titling stream). It should also be noted that the system may include many encoders, but in the figure only one encoder 1520 is represented to simplify the description without a lack of generality. It should be further understood that, although text and examples contained herein may specifically describe an encoding process, one skilled in the art would understand that the same concepts and principles also apply to the corresponding decoding process and vice versa. [0233] The coded media bitstream may be transferred to a storage 1530. The storage 1530 may comprise any type of mass memory to store the coded media bitstream. The format of the coded media bitstream in the storage 1530 may be an elementary self-contained bitstream format, or one or more coded media bitstreams may be encapsulated into a container file, or the coded media bitstream may be encapsulated into a Segment format suitable for DASH (or a similar streaming system) and stored as a sequence of Segments. If one or more media bitstreams are encapsulated in a container file, a file generator (not shown in the figure) may be used to store the one more media bitstreams in the file and create file format metadata, which may also be stored in the file. The encoder 1520 or the storage 1530 may comprise the file generator, or the file generator is operationally attached to either the encoder 1520 or the storage 1530. Some systems operate “live”, i.e. omit storage and transfer coded media bitstream from the encoder 1520 directly to the sender 1540. The coded media bitstream may then be transferred to the sender 1540, also referred to as the server, on a need basis. The format used in the transmission may be an elementary self-contained bitstream format, a packet stream format, a Segment format suitable for DASH (or a similar streaming system), or one or more coded media bitstreams may be encapsulated into a container file. The encoder 1520, the storage 1530, and the server 1540 may reside in the same physical device or they may be included in separate devices. The encoder 1520 and server 1540 may operate with live real-time content, in which case the coded media bitstream is typically not stored permanently, but rather buffered for small periods of time in the content encoder 1520 and/or in the server 1540 to smooth out variations in processing delay, transfer delay, and coded media bitrate. [0234] The server 1540 sends the coded media bitstream using a communication protocol stack. The stack may include but is not limited to one or more of Real-Time Transport Protocol (RTP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), Transmission Control Protocol (TCP), and Internet Protocol (IP). When the communication protocol stack is packet-oriented, the server 1540 encapsulates the coded media bitstream into packets. For example, when RTP is used, the server 1540 encapsulates the coded media bitstream into RTP packets according to an RTP payload format. Typically, each media type has a dedicated RTP payload format. It should be again noted that a system may contain more than one server 1540, but for the sake of simplicity, the following description only considers one server 1540. [0235] If the media content is encapsulated in a container file for the storage 1530 or for inputting the data to the sender 1540, the sender 1540 may comprise or be operationally attached to a "sending file parser" (not shown in the figure). In particular, if the container file is not transmitted as such but at least one of the contained coded media bitstream is encapsulated for transport over a communication protocol, a sending file parser locates appropriate parts of the coded media bitstream to be conveyed over the communication protocol. The sending file parser may also help in creating the correct format for the communication protocol, such as packet headers and payloads. The multimedia container file may contain encapsulation instructions, such as hint tracks in the ISOBMFF, for encapsulation of the at least one of the contained media bitstream on the communication protocol. [0236] The server 1540 may or may not be connected to a gateway 1550 through a communication network, which may e.g. be a combination of a CDN, the Internet and/or one or more access networks. The gateway may also or alternatively be referred to as a middle-box. For DASH, the gateway may be an edge server (of a CDN) or a web proxy. It is noted that the system may generally comprise any number gateways or alike, but for the sake of simplicity, the following description only considers one gateway 1550. The gateway 1550 may perform different types of functions, such as translation of a packet stream according to one communication protocol stack to another communication protocol stack, merging and forking of data streams, and manipulation of data stream according to the downlink and/or receiver capabilities, such as controlling the bit rate of the forwarded stream according to prevailing downlink network conditions. The gateway 1550 may be a server entity in various embodiments. [0370] The system includes one or more receivers 1560, typically capable of receiving, de- modulating, and de-capsulating the transmitted signal into a coded media bitstream. The coded media bitstream may be transferred to a recording storage 1570. The recording storage 1570 may comprise any type of mass memory to store the coded media bitstream. The recording storage 1570 may alternatively or additively comprise computation memory, such as random access memory. The format of the coded media bitstream in the recording storage 1570 may be an elementary self- contained bitstream format, or one or more coded media bitstreams may be encapsulated into a container file. If there are multiple coded media bitstreams, such as an audio stream and a video stream, associated with each other, a container file is typically used and the receiver 1560 comprises or is attached to a container file generator producing a container file from input streams. Some systems operate “live,” i.e. omit the recording storage 1570 and transfer coded media bitstream from the receiver 1560 directly to the decoder 1580. In some systems, only the most recent part of the recorded stream, e.g., the most recent 10-minute excerption of the recorded stream, is maintained in the recording storage 1570, while any earlier recorded data is discarded from the recording storage 1570. [0238] The coded media bitstream may be transferred from the recording storage 1570 to the decoder 1580. If there are many coded media bitstreams, such as an audio stream and a video stream, associated with each other and encapsulated into a container file or a single media bitstream is encapsulated in a container file e.g. for easier access, a file parser (not shown in the figure) is used to decapsulate each coded media bitstream from the container file. The recording storage 1570 or a decoder 1580 may comprise the file parser, or the file parser is attached to either recording storage 1570 or the decoder 1580. It should also be noted that the system may include many decoders, but here only one decoder 1570 is discussed to simplify the description without a lack of generality [0239] The coded media bitstream may be processed further by a decoder 1570, whose output is one or more uncompressed media streams. Finally, a renderer 1590 may reproduce the uncompressed media streams with a loudspeaker or a display, for example. The receiver 1560, recording storage 1570, decoder 1570, and renderer 1590 may reside in the same physical device or they may be included in separate devices. [0240] A sender 1540 and/or a gateway 1550 may be configured to perform switching between different representations e.g. for switching between different viewports of 360-degree video content, view switching, bitrate adaptation and/or fast start-up, and/or a sender 1540 and/or a gateway 1550 may be configured to select the transmitted representation(s). Switching between different representations may take place for multiple reasons, such as to respond to requests of the receiver 1560 or prevailing conditions, such as throughput, of the network over which the bitstream is conveyed. In other words, the receiver 1560 may initiate switching between representations. A request from the receiver can be, e.g., a request for a Segment or a Subsegment from a different representation than earlier, a request for a change of transmitted scalability layers and/or sub- layers, or a change of a rendering device having different capabilities compared to the previous one. A request for a Segment may be an HTTP GET request. A request for a Subsegment may be an HTTP GET request with a byte range. Additionally or alternatively, bitrate adjustment or bitrate adaptation may be used for example for providing so-called fast start-up in streaming services, where the bitrate of the transmitted stream is lower than the channel bitrate after starting or random-accessing the streaming in order to start playback immediately and to achieve a buffer occupancy level that tolerates occasional packet delays and/or retransmissions. Bitrate adaptation may include multiple representation or layer up-switching and representation or layer down- switching operations taking place in various orders. [0241] A decoder 1580 may be configured to perform switching between different representations e.g. for switching between different viewports of 360-degree video content, viewpoint switching, bitrate adaptation and/or fast start-up, and/or a decoder 1580 may be configured to select the transmitted representation(s). Switching between different representations may take place for multiple reasons, such as to achieve faster decoding operation or to adapt the transmitted bitstream, e.g. in terms of bitrate, to prevailing conditions, such as throughput, of the network over which the bitstream is conveyed. Thus, the decoder may comprise means for requesting at least one decoder reset picture of the second representation for carrying out bitrate adaptation between the first representation and a third representation. Faster decoding operation might be needed for example if the device including the decoder 1580 is multi-tasking and uses computing resources for other purposes than decoding the video bitstream. In another example, faster decoding operation might be needed when content is played back at a faster pace than the normal playback speed, e.g. twice or three times faster than conventional real-time playback rate. [0242] In the above, some embodiments have been described with reference to encoding. It needs to be understood that said encoding may comprise one or more of the following: encoding source image data into a bitstream, encapsulating the encoded bitstream in a container file and/or in packet(s) or stream(s) of a communication protocol, and announcing or describing the bitstream in a content description, such as the Media Presentation Description (MPD) of ISO/IEC 23009-1 (known as MPEG-DASH) or the IETF Session Description Protocol (SDP). Similarly, some embodiments have been described with reference to decoding. It needs to be understood that said decoding may comprise one or more of the following: decoding image data from a bitstream, decapsulating the bitstream from a container file and/or from packet(s) or stream(s) of a communication protocol, and parsing a content description of the bitstream, [0243] In the above, where the example embodiments have been described with reference to an encoder or an encoding method, it needs to be understood that the resulting bitstream and the decoder or the decoding method may have corresponding elements in them. Likewise, where the example embodiments have been described with reference to a decoder, it needs to be understood that the encoder may have structure and/or computer program for generating the bitstream to be decoded by the decoder. [0244] In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. [0245] Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. [0246] Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication. [0247] The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended examples. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention.

Claims

CLAIMS 1. An apparatus comprising means for: obtaining a presentation comprising volumetric video content; obtaining two or more components of the volumetric video content; packing the two or more components of the volumetric video content into separate video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signaling, in or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components.
2. The apparatus according to claim 1 further comprising means for: forming a supplementary enhancement information message for the settings and patch information signaling; and inserting the supplementary enhancement information message in or along the volumetric video bitstream to signal that specific network abstraction layer units should only be applied on a specific component of the volumetric video stream.
3. The apparatus according to claim 1 further comprising means for: signaling of a valid video encoded component of volumetric video inside a network abstraction layer unit in the volumetric video bitstream; and applying information contained in the network abstraction layer unit on a specific video encoded component of volumetric video.
4. The apparatus according to claim 1 further comprising means for: signaling of video encoded component in V-PCC unit header of V-PCC unit containing an atlas bitstream; and applying information contained in a network abstraction layer unit payload to a specific video encoded component of volumetric video.
5. The apparatus according to claim 1 further comprising means for: forming an extension information for the signaling; and inserting the extension information in the V-PCC parameter set and provide, in or along the V-PCC bitstream to signal that specific network abstraction layer units of atlas bitstream should only be applied on a specific component of the volumetric video stream.
6. The apparatus according to claim 5 further comprising means for: authoring a track where samples of this track comprise an extension information aligned with specific tracks comprising two or more video encoded bitstreams.
7. The apparatus according to any of the claims 1 to 6 further comprising means for: applying different patch packing methods for the two or more different components.
8. The apparatus according to any of the claims 1 to 7 further comprising means for: authoring a plurality of sets of tracks comprising two or more video encoded bitstreams and separate settings and patch information for the two or more different encoded components.
9. The apparatus according to claim 8 further comprising means for: signalling association between tracks in DASH representations via @associationId and @associationType attributes.
10. A method comprising: obtaining a presentation comprising volumetric video content; obtaining two or more components of the volumetric video content; packing the two or more components of the volumetric video content into separate video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signaling, in or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components.
11. The method according to claim 10 further comprising: forming a supplementary enhancement information message for the settings and patch information signaling; and inserting the supplementary enhancement information message in or along the volumetric video bitstream to signal that specific network abstraction layer units should only be applied on a specific component of the volumetric video stream.
12. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: obtain a presentation comprising volumetric video content; obtain two or more components of the volumetric video content; pack the two or more components of the volumetric video content into separate video encoded bitstreams to obtain two or more different encoded components of the volumetric video content; and signal, in or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components.
13. An apparatus comprising means for: receiving a volumetric video bitstream in a decoder, said bitstream comprising an encoded presentation of two or more components of a volumetric video content; receiving, from or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components; decoding, from the bitstream, the two or more components of a volumetric video content; unpacking patches from the two or more components of the volumetric video content by using the separate settings and patch information.
14. The apparatus according to claim 13 further comprising means for: decoding a supplementary enhancement information message to obtain the settings and patch information signaling; and retrieving the supplementary enhancement information message from or along the bitstream to determine whether specific network abstraction layer units should only be applied on a specific component of the volumetric video stream.
15. The apparatus according to claim 13 further comprising means for: obtaining a valid video encoded component of volumetric video from a network abstraction layer unit in the volumetric video bitstream; and applying information contained in the network abstraction layer unit on a specific video decoded component of volumetric video.
16. The apparatus according to claim 13 further comprising means for: receiving the signaling of video encoded component from a V-PCC unit header of a V- PCC unit containing an atlas bitstream; and applying information contained in a network abstraction layer unit payload to a specific video decoded component of volumetric video.
17. The apparatus according to claim 13 further comprising means for: receiving an extension information from the V-PCC parameter set; and obtaining, from or along the V-PCC bitstream a signal indicating that specific network abstraction layer units of atlas bitstream should only be applied on a specific component of the volumetric video stream.
18. The apparatus according to claim 17 further comprising means for: receiving a track where samples of this track comprise an extension information aligned with specific tracks comprising two or more video encoded bitstreams.
19. The apparatus according to any of the claims 13 to 18 further comprising means for: receiving a plurality of sets of tracks comprising two or more video encoded bitstreams and separate settings and patch information for the two or more different encoded components.
20. The apparatus according to claim 19 further comprising means for: receiving association between tracks in DASH representations via @associationId and @associationType attributes.
21. An apparatus comprising at least one processor and at least one memory, said at least one memory stored with computer program code thereon, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: receive a volumetric video bitstream in a decoder, said bitstream comprising an encoded presentation of two or more components of a volumetric video content; receive, from or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components; decode, from the bitstream, the two or more components of a volumetric video content; unpack patches from the two or more components of the volumetric video content by using the separate settings and patch information.
22. A method comprising: receiving a volumetric video bitstream in a decoder, said bitstream comprising an encoded presentation of two or more components of a volumetric video content; receiving, from or along the volumetric video bitstream, separate settings and patch information for the two or more different encoded components; decoding, from the bitstream, the two or more components of a volumetric video content; unpacking patches from the two or more components of the volumetric video content by using the separate settings and patch information.
23. The method according to claim 22 further comprising: decoding a supplementary enhancement information message to obtain the settings and patch information signaling; and retrieving the supplementary enhancement information message from or along the bitstream to determine whether specific network abstraction layer units should only be applied on a specific component of the volumetric video stream.
PCT/FI2021/050110 2020-03-04 2021-02-17 An apparatus, a method and a computer program for volumetric video WO2021176133A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20205226 2020-03-04
FI20205226 2020-03-04

Publications (1)

Publication Number Publication Date
WO2021176133A1 true WO2021176133A1 (en) 2021-09-10

Family

ID=77612517

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2021/050110 WO2021176133A1 (en) 2020-03-04 2021-02-17 An apparatus, a method and a computer program for volumetric video

Country Status (1)

Country Link
WO (1) WO2021176133A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113938667A (en) * 2021-10-25 2022-01-14 深圳普罗米修斯视觉技术有限公司 Video data transmission method and device based on video stream data and storage medium
US20230239508A1 (en) * 2020-04-03 2023-07-27 Intel Corporation Methods and apparatus to identify a video decoding error

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3474562A1 (en) * 2017-10-20 2019-04-24 Thomson Licensing Method, apparatus and stream for volumetric video format
WO2019158821A1 (en) * 2018-02-19 2019-08-22 Nokia Technologies Oy An apparatus, a method and a computer program for volumetric video
EP3547703A1 (en) * 2018-03-30 2019-10-02 Thomson Licensing Method, apparatus and stream for volumetric video format
EP3709273A1 (en) * 2019-03-14 2020-09-16 Nokia Technologies Oy Signalling of metadata for volumetric video
US20210067757A1 (en) * 2019-08-29 2021-03-04 Electronics And Telecommunications Research Institute Method for processing immersive video and method for producing immersive video
WO2021063887A1 (en) * 2019-10-02 2021-04-08 Interdigital Vc Holdings France, Sas A method and apparatus for encoding, transmitting and decoding volumetric video

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3474562A1 (en) * 2017-10-20 2019-04-24 Thomson Licensing Method, apparatus and stream for volumetric video format
WO2019158821A1 (en) * 2018-02-19 2019-08-22 Nokia Technologies Oy An apparatus, a method and a computer program for volumetric video
EP3547703A1 (en) * 2018-03-30 2019-10-02 Thomson Licensing Method, apparatus and stream for volumetric video format
EP3709273A1 (en) * 2019-03-14 2020-09-16 Nokia Technologies Oy Signalling of metadata for volumetric video
US20210067757A1 (en) * 2019-08-29 2021-03-04 Electronics And Telecommunications Research Institute Method for processing immersive video and method for producing immersive video
WO2021063887A1 (en) * 2019-10-02 2021-04-08 Interdigital Vc Holdings France, Sas A method and apparatus for encoding, transmitting and decoding volumetric video

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"ISO/IEC JTC1/SC29/WG11. V-PCC Future Enhancements, MPEG document N18888. DIS stage draft of ISO/IEC 23090-5:2019(E) Information technology - Coded Representation of Immersive Media - Part 5: Video-based Point Cloud Compression", MPEG DOCUMENT MANAGEMENT SYSTEM, 12 January 2020 (2020-01-12), pages 1 - 209, XP030225588, Retrieved from the Internet <URL:http://wg11.sc29.org> [retrieved on 20201030] *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230239508A1 (en) * 2020-04-03 2023-07-27 Intel Corporation Methods and apparatus to identify a video decoding error
CN113938667A (en) * 2021-10-25 2022-01-14 深圳普罗米修斯视觉技术有限公司 Video data transmission method and device based on video stream data and storage medium
CN113938667B (en) * 2021-10-25 2023-07-25 珠海普罗米修斯视觉技术有限公司 Video data transmission method, device and storage medium based on video stream data

Similar Documents

Publication Publication Date Title
JP7506077B2 (en) Apparatus, method, and computer program for video encoding and decoding
US20220239949A1 (en) An apparatus, a method and a computer program for video encoding and decoding
US11523135B2 (en) Apparatus, a method and a computer program for volumetric video
EP3614674A1 (en) An apparatus, a method and a computer program for volumetric video
EP3818716A1 (en) An apparatus, a method and a computer program for video coding and decoding
US11659151B2 (en) Apparatus, a method and a computer program for volumetric video
WO2019243663A1 (en) An apparatus, a method and a computer program for volumetric video
WO2020141260A1 (en) An apparatus, a method and a computer program for video coding and decoding
US20230059516A1 (en) Apparatus, a method and a computer program for omnidirectional video
US20220335978A1 (en) An apparatus, a method and a computer program for video coding and decoding
WO2019229293A1 (en) An apparatus, a method and a computer program for volumetric video
WO2019129919A1 (en) An apparatus, a method and a computer program for volumetric video
WO2020183053A1 (en) Method and apparatus for late binding in media content
WO2019115866A1 (en) An apparatus, a method and a computer program for volumetric video
WO2021176133A1 (en) An apparatus, a method and a computer program for volumetric video
WO2020070378A1 (en) An apparatus, a method and a computer program for volumetric video
WO2019234290A1 (en) An apparatus, a method and a computer program for volumetric video
EP4207764A1 (en) An apparatus, a method and a computer program for volumetric video
RU2784900C1 (en) Apparatus and method for encoding and decoding video
EP3873095A1 (en) An apparatus, a method and a computer program for omnidirectional video
EP3680859A1 (en) An apparatus, a method and a computer program for volumetric video

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21764630

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21764630

Country of ref document: EP

Kind code of ref document: A1