WO2020070379A1

WO2020070379A1 - Method and apparatus for storage and signaling of compressed point clouds

Info

Publication number: WO2020070379A1
Application number: PCT/FI2019/050678
Authority: WO
Inventors: Emre Aksu; Igor Curcio; Miska Hannuksela; Vinod Kumar MALAMAL VADAKITAL; Lauri ILOLA; Sebastian Schwarz
Original assignee: Nokia Technologies Oy
Priority date: 2018-10-03
Filing date: 2019-09-20
Publication date: 2020-04-09

Abstract

A method, apparatus and computer program product are provided tosignal and store compressed point clouds in video encoding. The method, apparatus and computer program product may be utilized in conjunction with a variety of video formats. Relative to encoding of compressed point clouds, the method, apparatus and computer program product access a point cloud compression coded bitstream and cause storage of the point cloud compression coded bitstream. The point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream. Relative to the decoding of compressed point clouds, the method,apparatus and computer program product receive a point cloud compression coded bitstream and decode the point cloud compression coded bitstream.

Description

METHOD AND APPARATUS FOR STORAGE AND SIGNALING OF COMPRESSED

POINT CLOUDS

TECHNICAL FIELD

[0001] An example embodiment relates generally to volumetric video encoding and decoding.

BACKGROUND

[0002] Volumetric video data represents a three-dimensional scene or object and can be used as input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications. Such data describes geometry (shape, size, position in three dimensional (3D) space) and respective attributes (e.g. color, opacity, reflectance, etc.), plus temporal changes of the geometry and attributes at given time instances (like frames in two dimensional (2D) video). Volumetric video is either generated from 3D models, e.g., computer-generated imagery (CGI), or captured from real-world scenes using a variety of capture solutions, e.g., multiple cameras, a laser scan, a combination of video and dedicated depth sensors, and more. One example representation format for such volumetric data is point clouds. Temporal information about the scene can be included in the form of individual capture instances, e.g., “frames” in two dimensional (2D) video, or other mechanisms, e.g. the position of an object as a function of time.

[0003] Point clouds are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. However, standard point clouds suffer from poor temporal compression performance when a reconstructed 3D scene contains tens or even hundreds of millions of points. Identifying correspondences for motion- compensation in 3D-space is an ill-defined problem because geometry and other respective attributes may change. For example, temporal successive frames do not necessarily have the same number of point clouds. As a result, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, e.g., multiview and depth, have much better compression efficiency, but rarely cover the full scene. These 2D based approaches provide only limited six degree of freedom (6DOF) capabilities. Therefore, an approach that has good compression efficiency and high degree of freedom is desirable.

[0004] One approach that has both good compression efficiency and a high degree of freedom is to project volumetric models represented in compressed point clouds onto 2D planes. This approach allows for using 2D video coding tools with highly efficient temporal compression while maintaining a degree of freedom better than a purely 2D based approach. A compressed point cloud includes an encoded video bitstream corresponding to the texture of the 3D object, an encoded video bitstream corresponding to the geometry of the 3D object, and an encoded metadata bitstream corresponding to all other auxiliary information which is necessary during the composition of a 3D object. However, there is no well-defined, efficient and standardized storage and signalling mechanism for these streams.

BRIEF SUMMARY

[0005] A method, apparatus and computer program product are provided in accordance with an example embodiment to signal and store compressed point clouds in video encoding. The method, apparatus and computer program product may be utilized in conjunction with a variety of video formats.

[0006] In one example embodiment, a method is provided that includes accessing a point cloud compression coded bitstream. The point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream. The method further includes causing storage of the point cloud compression coded bitstream.

[0007] In some implementations of such a method, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure; multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped; and causing construction and storage of a video media track. The video media track comprises the multiplexed one or more coded access units. In some embodiments, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; storing the texture information bitstream in a texture video media track; storing the geometry information bitstream in a geometry video media track; storing the auxiliary metadata information bitstream in a timed metadata track; and causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track.

[000S] In another example embodiment, an apparatus is provided that includes at least one processor and at least one memory including computer program code for one or more programs with the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to access a point cloud compression coded bitstream and cause storage of the point cloud compression coded bitstream. The point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream.

[0009] In some implementations of such an apparatus, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure; multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped; and causing construction and storage of a video media track. The video media track comprises the multiplexed one or more coded access units. In some embodiments, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; storing the texture information bitstream in a texture video media track; storing the geometry information bitstream in a geometry video media track; storing the auxiliary metadata information bitstream in a timed metadata track; and causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track.

[0010] In another example embodiment, an apparatus is provided that includes means for accessing a point cloud compression coded bitstream and means for causing storage of the point cloud compression coded bitstream. The point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream.

[0011] In some implementations of such an apparatus, the means for causing storage of the point cloud compression coded bitstream includes means for accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; means for detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure; means for multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped; and means for causing construction and storage of a video media track. The video media track comprises the multiplexed one or more coded access units. In some embodiments, the means for causing storage of the point cloud compression coded bitstream includes means for accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; means for packing the texture information bitstream and the geometry information bitstream into one video media track; means for storing the auxiliary metadata information in a timed metadata track; and means for causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track.

[0012] In another example embodiment, a computer program product is provided that includes at least one non-transitory computer-readable storage medium having computer executable program code instructions stored therein with the computer executable program code instructions comprising program code instructions configured, upon execution, to access a point cloud compression coded bitstream and cause storage of the point cloud compression coded bitstream. The point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream.

[0013] In some implementations of such a computer program product, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure; multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped; and causing construction and storage of a video media track. The video media track comprises the multiplexed one or more coded access units. In some embodiments, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; storing the texture information bitstream in a texture video media track; storing the geometry information bitstream in a geometry video media track; storing the auxiliary metadata information bitstream in a timed metadata track; and causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track. [0014] In another example embodiment, a method is provided that includes receiving a point cloud compression coded bitstream. The point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream. The method further includes decoding the point cloud compression coded bitstream.

[0015] In some implementations of such a method, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure; multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped; and causing construction and storage of a video media track. The video media track comprises the multiplexed one or more coded access units. In some embodiments, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; storing the texture information bitstream in a texture video media track; storing the geometry information bitstream in a geometry video media track; storing the auxiliary metadata information bitstream in a timed metadata track; and causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track.

[0016] In another example embodiment, an apparatus is provided that includes at least one processor and at least one memory including computer program code for one or more programs with the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to receive a point cloud compression coded bitstream and decode the point cloud compression coded bitstream. The point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream.

[0017] In some implementations of such an apparatus, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure; multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped; and causing construction and storage of a video media track. The video media track comprises the multiplexed one or more coded access units. In some embodiments, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; storing the texture information bitstream in a texture video media track; storing the geometry information bitstream in a geometry video media track; storing the auxiliary metadata information bitstream in a timed metadata track; and causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track.

[0018] In another example embodiment, an apparatus is provided that includes means for receiving a point cloud compression coded bitstream and means for decoding the point cloud compression coded bitstream. The point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream.

[0019] In some implementations of such an apparatus, the means for causing storage of the point cloud compression coded bitstream includes means for accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; means for detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure; means for multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped; and means for causing construction and storage of a video media track. The video media track comprises the multiplexed one or more coded access units. In some embodiments, the means for causing storage of the point cloud compression coded bitstream includes means for accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; means for packing the texture information bitstream and the geometry information bitstream into one video media track; means for storing the auxiliary metadata information in a timed metadata track; and means for causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track. [0020] In another example embodiment, a computer program product is provided that includes at least one non-transitory computer-readable storage medium having computer executable program code instructions stored therein with the computer executable program code instructions comprising program code instructions configured, upon execution, to receive a point cloud compression coded bitstream and decode the point cloud compression coded bitstream. The point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream.

[0021] In some implementations of such a computer program product, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure; multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped; and causing construction and storage of a video media track. The video media track comprises the multiplexed one or more coded access units. In some embodiments, causing storage of the point cloud compression coded bitstream includes accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; storing the texture information bitstream in a texture video media track; storing the geometry information bitstream in a geometry video media track; storing the auxiliary metadata information bitstream in a timed metadata track; and causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track.

BRIEF DESCRIPTION OF THE DRAWINGS

[0022] Having thus described certain example embodiments of the present disclosure in general terms, reference will hereinafter be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:

[0023] Figure 1 is a block diagram of an apparatus that may be specifically configured in accordance with an example embodiment of the present disclosure;

[0024] Figure 2A and 2B are graphical representations illustrating a set of operations performed, such as by the apparatus of Figure 1, in accordance with an example embodiment of the present disclosure; [0025] Figure 3 illustrates an example of terms in the present disclosure;

[0026] Figure 4 is a flowchart illustrating a set of operations performed, such as by the apparatus of Figure 1, in accordance with an example embodiment of the present disclosure;

[0027] Figure 5 is a flowchart illustrating a set of operations performed, such as by the apparatus of Figure 1 , in accordance with an example embodiment of the present disclosure;

[0028] Figure 6 is a flowchart illustrating a set of operations performed, such as by the apparatus of Figure 1 , in accordance with an example embodiment of the present disclosure;

[0029] Figure 7 is a flowchart illustrating a set of operations performed, such as by the apparatus of Figure 1 , in accordance with an example embodiment of the present disclosure;

[0030] Figure 8A is a flowchart illustrating a set of operations performed, such as by the apparatus of Figure 1, in accordance with an example embodiment of the present disclosure;

[0031] Figure 8B provides a graphical representation of an example track used to signal a multiplexed PCC bitstream in accordance with an example embodiment of the present disclosure;

[0032] Figure 9A is a flowchart illustrating a set of operations performed, such as by the apparatus of Figure 1 , in accordance with an example embodiment of the present disclosure;

[0033] Figure 9B provides a graphical representation of an example interleaving samples in accordance with an example embodiment of the present disclosure; and

[0034] Figure 10 is a flowchart illustrating a set of operations performed, such as by the apparatus of Figure 1, in accordance with an example embodiment of the present disclosure.

DETAILED DESCRIPTION

[0035] Some embodiments will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout. As used herein, the terms“data,”“content,”“information,” and similar terms may be used interchangeably to refer to data capable of being transmitted, received and/or stored in accordance with embodiments of the present invention. Thus, use of any such terms should not be taken to limit the spirit and scope of embodiments of the present invention.

[0036] Additionally, as used herein, the term‘circuitry’ refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry); (b) combinations of circuits and computer program product(s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation even if the software or firmware is not physically present. This definition of‘circuitry’ applies to all uses of this term herein, including in any claims. As a further example, as used herein, the term‘circuitry’ also includes an implementation comprising one or more processors and/or portion(s) thereof and accompanying software and/or firmware. As another example, the term‘circuitry’ as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.

[0037] As defined herein, a“computer-readable storage medium,” which refers to a non- transitory physical storage medium (e.g., volatile or non-volatile memory device), can be differentiated from a“computer-readable transmission medium,” which refers to an electromagnetic signal.

[0038] A method, apparatus and computer program product are provided in accordance with an example embodiment to signal and store compressed point clouds in video encoding. The method, apparatus and computer program product may be utilized in conjunction with a variety of video formats including High Efficiency Video Coding standard (HEVC or H.265/HEVC), Advanced Video Coding standard (AVC or H.264/AVC), the upcoming Versatile Video Coding standard (VVC or H.266/VVC), and/or with a variety of video and multimedia file formats including International Standards Organization (ISO) base media file format (ISO/IEC 14496-12, which may be abbreviated as ISOBMFF), Moving Picture Experts Group (MPEG)-4 file format (ISO/IEC 14496-14, also known as the MP4 format), file formats for NAL (Network Abstraction Layer) unit structured video (ISO/IEC 14496-15) and 3^rd Generation Partnership Project (3GPP file format) (3GPP Technical Specification 26.244, also known as the 3GP format). ISOBMFF is the base for derivation of all the above mentioned file formats. An example embodiment is described in conjunction with the HEVC, however, the present disclosure is not limited to HEVC, but rather the description is given for one possible basis on top of which an example embodiment of the present disclosure may be partly or fully realized.

[0039] Some aspects of the disclosure relate to container file formats, such as

International Standards Organization (ISO) base media file format (ISO/IEC 14496-12, which may be abbreviated as ISOBMFF), Moving Picture Experts Group (MPEG)-4 file format (ISO/IEC 14496-14, also known as the MP4 format), file formats for NAL (Network Abstraction Layer) unit structured video (ISO/IEC 14496-15) and 3^rd Generation Partnership Project (3GPP file format) (3GPP Technical Specification 26.244, also known as the 3GP format). Example embodiments are described in conjunction with the ISOBMFF or its derivatives, however, the present disclosure is not limited to ISOBMFF, but rather the description is given for one possible basis on top of which an example embodiment of the present disclosure may be partly or fully realized.

[0040] An elementary unit for the output of an H.264/advanced video coding (AVC) or HEYC encoder and the input of an H.264/AVC or HEVC decoder, respectively, is a NAF unit. For transport over packet-oriented networks or storage into structured files, NAL units may be encapsulated into packets or similar structures. In ISO base media file format, NAF units of an access unit form a sample, the size of which is provided within file format metadata.

[0041] A bytestream format has been specified in H.264/AVC and HEVC for

transmission or storage environments that do not provide framing structures. The bytestream format separates NAL units from each other by attaching a start code in front of each NAL unit. To avoid false detection of NAL unit boundaries, encoders run a byte-oriented start code emulation prevention algorithm, which adds an emulation prevention byte to the NAL unit payload if a start code would have occurred otherwise. In order to enable straightforward gateway operation between packet- and stream-oriented systems, start code emulation prevention may always be performed regardless of whether the bytestream format is in use or not. A NAL unit may be defined as a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of a raw byte sequence payload (RBSP) interspersed as necessary with emulation prevention bytes. A RBSP may be defined as a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit. An RBSP is either empty or has the form of a string of data bits containing syntax elements followed by an RBSP stop bit and followed by zero or more subsequent bits equal to 0.

[0042] NAL units can be categorized into Video Coding Layer (VCL) NAL units and non-VCL NAL units. VCL NAL units are typically coded slice NAL units.

[0043] A non-VCL NAL unit may be, for example, one of the following types: a sequence parameter set, a picture parameter set, a supplemental enhancement information (SEI) NAL unit, an access unit delimiter, an end of sequence NAL unit, an end of bitstream NAL unit, or a filler data NAL unit. Parameter sets may be needed for the reconstruction of decoded pictures, whereas many of the other non-VCL NAL units are not necessary for the

reconstruction of decoded sample values. [0044] An SEI NAL unit may contain one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, rendering, error detection, error concealment, and resource reservation. Several SEI messages are specified in H.264/AVC and HEVC, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use. E1.264/AVC and HEVC contain the syntax and semantics for the specified SEI messages but no process for handling the messages in the recipient is defined. Consequently, encoders are required to follow the H.264/AVC standard or the HEVC standard when they create SEI messages, and decoders conforming to the H.264/AVC standard or the HEVC standard, respectively, are not required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in H.264/AVC and HEVC is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified.

[0045] In HEVC, there are two types of SEI NAL units, namely the suffix SEI NAL unit and the prefix SEI NAL unit, having a different nal_unit_type value from each other. The SEI message(s) contained in a suffix SEI NAL unit are associated with the VCL NAL unit preceding, in decoding order, the suffix SEI NAL unit. The SEI message(s) contained in a prefix SEI NAL unit are associated with the VCL NAL unit following, in decoding order, the prefix SEI NAL unit.

[0046] Video coding specifications may contain a set of constraints for associating data units (e.g., NAL units in H.264/AVC or HEVC) into access units. These constraints may be used to conclude access unit boundaries from a sequence of NAL units. For example, the following is specified in the HEVC standard:

- An access unit consists of one coded picture with nuh layer id equal to 0, zero or more video coding layer (VCL) NAL units with nuh_layer_id greater than 0 and zero or more non- VCL NAL units.

Let firstBIPicNalUnit be the first VCL NAL unit of a coded picture with nuh layer id equal to 0. The first of any of the following NAL units preceding firstBIPicNalUnit and succeeding the last VCL NAL unit preceding firstBIPicNalUnit, if any, specifies the start of a new access unit:

o access unit delimiter NAL unit with nuh_layer_id equal to 0 (when present), o video parameter set (VPS) NAL unit with nuh_layer_id equal to 0 (when present),

o sequence parameter set (SPS) NAL unit with nuh_layer_id equal to 0 (when present),

o picture parameter set (PPS) NAL unit with nuh_layer_id equal to 0 (when present), o Prefix supplemental enhancement information (SEI) NAL unit with nuh_layer_id equal to 0 (when present),

o NAL units with nal unit type in the range of RSV_NVCL4l ...

RSV_NVCL44 (which represent reserved NAL units) with nuh_layer_id equal to 0 (when present),

o NAL units with nal_unit_type in the range of UNSPEC48... UNSPEC55 (which represent unspecified NAL units) with nuh_layer_id equal to 0 (when present).

The first NAL unit preceding firstBlPicNalUnit and succeeding the last VCL NAL unit preceding firstBlPicNalUnit, if any, can only be one of the above-listed NAL units.

When there is none of the above NAL units preceding firstBlPicNalUnit and succeeding the last VCL NAL preceding firstBlPicNalUnit, if any, firstBlPicNalUnit starts a new access unit.

[0047] Frame packing may be defined to comprise arranging more than one input picture, which may be referred to as (input) constituent frames, into an output picture. In general, frame packing is not limited to any particular type of constituent frames or the constituent frames need not have a particular relation with each other. In many cases, frame packing is used for arranging constituent frames of a stereoscopic video clip into a single picture sequence. The arranging may include placing the input pictures in spatially non-overlapping areas within the output picture. For example, in a side-by-side arrangement, two input pictures are placed within an output picture horizontally adjacently to each other. The arranging may also include partitioning of one or more input pictures into two or more constituent frame partitions and placing the constituent frame partitions in spatially non-overlapping areas within the output picture. The output picture or a sequence of frame-packed output pictures may be encoded into a bitstream e.g., by a video encoder. The bitstream may be decoded e.g., by a video decoder. The decoder or a post-processing operation after decoding may extract the decoded constituent frames from the decoded picture(s), e.g., for displaying.

[0048] A basic building block in the ISO base media file format is called a box. Each box has a header and a payload. The box header indicates the type of the box and the size of the box in terms of bytes. A box may enclose other boxes, and the ISO file format specifies which box types are allowed within a box of a certain type. Furthermore, the presence of some boxes may be mandatory in each file, while the presence of other boxes may be optional. Additionally, for some box types, it may be allowable to have more than one box present in a file. Thus, the ISO base media file format may be considered to specify a hierarchical structure of boxes. [0049] According to the ISO base media file format, a file includes media data and metadata that are encapsulated into boxes. Each box is identified by a four character code (4CC) and starts with a header which informs about the type and size of the box.

[0050] Many files formatted according to the ISO base media file format start with a file type box, also referred to as FileTypeBox or the ftyp box. The ftyp box contains information of the brands labeling the file. The ftyp box includes one major brand indication and a list of compatible brands. The major brand identifies the most suitable file format specification to be used for parsing the file. The compatible brands indicate to which file format specifications and/or conformance points the file conforms ft is possible that a file is conformant to multiple specifications. All brands indicating compatibility to these specifications should be listed, so that a reader only understanding a subset of the compatible brands can get an indication that the file can be parsed. Compatible brands also give a permission for a file parser of a particular file format specification to process a file containing the same particular file format brand in the ftyp box. A file player may check if the ftyp box of a file comprises brands it supports, and may parse and play the file only if any file format specification supported by the file player is listed among the compatible brands.

[0051] In files conforming to the ISO base media file format, the media data may be provided in one or more instances of MediaDataBox (‘mdat‘) and the MovieBox (‘moov’) may be used to enclose the metadata for timed media. In some cases, for a file to be operable, both of the‘mdat’ and‘moov’ boxes may be required to be present. The‘moov’ box may include one or more tracks, and each track may reside in one corresponding TrackBox (‘trak’). Each track is associated with a handler, identified by a four-character code, specifying the track type. Video, audio, and image sequence tracks can be collectively called media tracks, and they contain an elementary media stream. Other track types comprise hint tracks and timed metadata tracks.

[0052] Tracks comprise samples, such as audio or video frames. For video tracks, a media sample may correspond to a coded picture or an access unit. A media track refers to samples (which may also be referred to as media samples) formatted according to a media

compression format (and its encapsulation to the ISO base media file format). A hint track refers to hint samples, containing cookbook instructions for constructing packets for transmission over an indicated communication protocol. A timed metadata track may refer to samples describing referred media and/or hint samples.

[0053] The 'trak' box includes in its hierarchy of boxes the SampleTableBox (also known as the sample table or the sample table box). The SampleTableBox contains the

SampleDescriptionBox, which gives detailed information about the coding type used, and any initialization information needed for that coding. The SampleDescriptionBox contains an entry-count and as many sample entries as the entry-count indicates. The format of sample entries is track-type specific but derived from generic classes (e.g., VisualSampleEntry, AudioSampleEntry). The type of sample entry form used for derivation the track-type specific sample entry format is determined by the media handler of the track.

[0054] A TrackTypeBox may be contained in a TrackBox. The payload of TrackTypeBox has the same syntax as the payload of FileTypeBox. The content of an instance of

TrackTypeBox shall be such that it would apply as the content of FileTypeBox, if all other tracks of the file were removed and only the track containing this box remained in the file.

[0055] Movie fragments may be used, for example, when recording content to ISO files, for example, in order to avoid losing data if a recording application crashes, runs out of memory space, or some other incident occurs. Without movie fragments, data loss may occur because the file format may require that all metadata, for example, a movie box, be written in one contiguous area of the file. Furthermore, when recording a file, there may not be sufficient amount of memory space to buffer a movie box for the size of the storage available, and re-computing the contents of a movie box when the movie is closed may be too slow. Moreover, movie fragments may enable simultaneous recording and playback of a file using a regular ISO file parser. Furthermore, a smaller duration of initial buffering may be required for progressive downloading, e.g., simultaneous reception and playback of a file when movie fragments are used and the initial movie box is smaller compared to a file with the same media content but structured without movie fragments.

[0056] The movie fragment feature may enable splitting the metadata that otherwise might reside in the movie box into multiple pieces. Each piece may correspond to a certain period of time of a track. In other words, the movie fragment feature may enable interleaving file metadata and media data. Consequently, the size of the movie box may be limited and the use cases mentioned above be realized.

[0057] In some examples, the media samples for the movie fragments may reside in an mdat box. For the metadata of the movie fragments, however, a moof box may be provided. The moof box may include the information for a certain duration of playback time that would previously have been in the moov box. The moov box may still represent a valid movie on its own, but in addition, it may include an mvex box indicating that movie fragments will follow in the same file. The movie fragments may extend the presentation that is associated to the moov box in time.

[0058] Within the movie fragment there may be a set of track fragments, including anywhere from zero to a plurality of track fragments per track. The track fragments may in turn include anywhere from zero to a plurality of track runs, each of which document a contiguous run of samples for that track (and hence are similar to chunks). Within these structures, many fields are optional and can be defaulted. The metadata that may be included in the moof box may be limited to a subset of the metadata that may be included in a moov box and may be coded differently in some cases. Details regarding the boxes that can be included in a moof box may be found in the ISOBMFF specification.

[0059] A self-contained movie fragment may be defined to consist of a moof box and an mdat box that are consecutive in the file order with the mdat box containing the samples of the movie fragment (for which the moof box provides the metadata) and does not contain samples of any other movie fragment, that is, any other moof box. A media segment may comprise one or more self-contained movie fragments. A media segment may be used for delivery, such as streaming, e.g., in MPEG-Dynamic Adaptive Streaming over Hypertext Transfer Protocol (HTTP) (MPEG-DASH).

[0060] The track reference mechanism can be used to associate tracks with each other. The TrackReferenceBox includes box(es), each of which provides a reference from the containing track to a set of other tracks. These references are labelled through the box type (e.g., the four-character code of the box) of the contained box(es). The ISO Base Media File Format contains three mechanisms for timed metadata that can be associated with particular samples: sample groups, timed metadata tracks, and sample auxiliary information. A derived specification may provide similar functionality with one or more of these three mechanisms.

[0061] TrackGroupBox, which is contained in TrackBox, enables indication of groups of tracks where each group shares a particular characteristic or the tracks within a group have a particular relationship. The box contains zero or more boxes, and the particular characteristic or the relationship is indicated by the box type of the contained boxes. The contained boxes include an identifier, which can be used to conclude the tracks belonging to the same track group. The tracks that contain the same type of a contained box within the TrackGroupBox and have the same identifier value within these contained boxes belong to the same track group. The syntax of the contained boxes may be defined through TrackGroupTypeBox is follows:

aligned(8) class TrackGroupTypeBox(unsigned int(32) track_group_type) extends FullBox(track_group_type, version = 0, flags = 0)

{

unsigned int(32) track_group_id;

// the remaining data may be specified

//for a particular track_group_type

} [0062] The ISO Base Media File Format contains three mechanisms for timed metadata that can be associated with particular samples: sample groups, timed metadata tracks, and sample auxiliary information. A derived specification may provide similar functionality with one or more of these three mechanisms.

[0063] A sample grouping in the ISO base media file format and its derivatives, such as the AYC file format and the scalable video coding (SVC) file format, may be defined as an assignment of each sample in a track to be a member of one sample group, based on a grouping criterion. A sample group in a sample grouping is not limited to being contiguous samples and may contain non-adjacent samples. As there may be more than one sample grouping for the samples in a track, each sample grouping may have a type field to indicate the type of grouping. Sample groupings may be represented by two linked data structures: (1) a SampleToGroupBox (sbgp box) represents the assignment of samples to sample groups; and (2) a SampleGroupDescriptionBox (sgpd box) contains a sample group entry for each sample group describing the properties of the group. There may be multiple instances of the

SampleToGroupBox and SampleGroupDescriptionBox based on different grouping criteria. These may be distinguished by a type field used to indicate the type of grouping.

SampleToGroupBox may comprise a grouping_type_parameter field that can be used e.g., to indicate a sub-type of the grouping.

[0064] Per-sample auxiliary information may be stored anywhere in the same file as the sample data itself. For self-contained media files, this is typically in a MediaDataBox or a box from a derived specification. It is stored either (a) in multiple chunks, with the number of samples per chunk, as well as the number of chunks, matching the chunking of the primary sample data or (b) in a single chunk for all the samples in a movie sample table (or a movie fragment). The Sample Auxiliary Information for all samples contained within a single chunk (or track run) is stored contiguously (similarly to sample data).

[0065] Sample Auxiliary Information, when present, is stored in the same file as the samples to which it relates as they share the same data reference ('dref ) structure. However, this data may be located anywhere within this file, using auxiliary information offsets ('saio') to indicate the location of the data.

[0066] The restricted video ('resv') sample entry and mechanism has been specified for the ISOBMFF in order to handle situations where the file author requires certain actions on the player or Tenderer after decoding of a visual track. Players not recognizing or not capable of processing the required actions are stopped from decoding or rendering the restricted video tracks. The 'resv' sample entry mechanism applies to any type of video codec. A

RestrictedSchemelnfoBox is present in the sample entry of 'resv' tracks and comprises an OriginalFormatBox, SchemeTypeBox, and SchemelnformationBox. The original sample entry type that would have been unless the 'resv' sample entry type were used is contained in the OriginalFormatBox. The SchemeTypeBox provides an indication which type of processing is required in the player to process the video. The SchemelnformationBox comprises further information of the required processing. The scheme type may impose requirements on the contents of the SchemelnformationBox. For example, the stereo video scheme indicated in the SchemeTypeBox indicates that when decoded frames either contain a representation of two spatially packed constituent frames that form a stereo pair (frame packing) or only one view of a stereo pair (left and right views in different tracks).

StereoVideoBox may be contained in a SchemelnformationBox to provide further information, e.g., on which type of frame packing arrangement has been used (e.g., side-by- side or top-bottom).

[0067] Several types of stream access points (SAPs) have been specified. SAP Type 1 corresponds to what is known in some coding schemes as a“Closed group of pictures (GOP) random access point” (in which all pictures, in decoding order, can be correctly decoded, resulting in a continuous time sequence of correctly decoded pictures with no gaps) and in addition the first picture in decoding order is also the first picture in presentation order. SAP Type 2 corresponds to what is known in some coding schemes as a“Closed GOP random access point” (in which all pictures, in decoding order, can be correctly decoded, resulting in a continuous time sequence of correctly decoded pictures with no gaps), for which the first picture in decoding order may not be the first picture in presentation order. SAP Type 3 corresponds to what is known in some coding schemes as an“Open GOP random access point”, in which there may be some pictures in decoding order that cannot be correctly decoded and have presentation times less than intra-coded picture associated with the SAP.

[0068] A stream access point (SAP) sample group as specified in ISOBMFF identifies samples as being of the indicated SAP type.

[0069] A sync sample may be defined as a sample corresponding to SAP type 1 or 2. A sync sample can be regarded as a media sample that starts a new independent sequence of samples; if decoding starts at the sync sample, it and succeeding samples in decoding order can all be correctly decoded, and the resulting set of decoded samples forms the correct presentation of the media starting at the decoded sample that has the earliest composition time. Sync samples can be indicated with the SyncSampleBox (for those samples whose metadata is present in a TrackBox) or within sample flags indicated or inferred for track fragment runs. [0070] Files conforming to the ISOBMFF may contain any non-timed objects, referred to as items, meta items, or metadata items, in a meta box (fourCC:‘meta’), which may also be called MetaBox. While the name of the meta box refers to metadata, items can generally contain metadata or media data. The meta box may reside at the top level of the file, within a movie box (fourCC:‘moov’), and within a track box (fourCC:‘trak’), but at most one meta box may occur at each of the file level, movie level, or track level. The meta box may be required to contain a‘hdlr box indicating the structure or format of the‘meta’ box contents. The meta box may list and characterize any number of items that can be referred and each one of them can be associated with a file name and can be uniquely identified with the file by item identifier (item id) which is an integer value. The metadata items may be for example stored in the 'idaf box of the meta box or in an 'mdat' box or reside in a separate file. If the metadata is located external to the file then its location may be declared by the DatalnformationBox (fourCC:‘dinf ). In the specific case that the metadata is formatted using extensible Markup Language (XML) syntax and is required to be stored directly in the MetaBox, the metadata may be encapsulated into either the XMLBox (fourCC:‘xml‘) or the BinaryXMLBox (fourcc:‘bxmF). An item may be stored as a contiguous byte range, or it may be stored in several extents, each being a contiguous byte range. In other words, items may be stored fragmented into extents, e.g., to enable interleaving. An extent is a contiguous subset of the bytes of the resource, and the resource can be formed by concatenating the extents.

[0071] High Efficiency Image File Format (HEIF) is a standard developed by the Moving Picture Experts Group (MPEG) for storage of images and image sequences. Among other things, the standard facilitates file encapsulation of data coded according to the High

Efficiency Video Coding (HEVC) standard. HEIF includes features building on top of the used ISO Base Media File Format (ISOBMFF).

[0072] The ISOBMFF structures and features are used to a large extent in the design of HEIF. The basic design for HEIF comprises that still images are stored as items and image sequences are stored as tracks.

[0073] In the context of HEIF, the following boxes may be contained within the root-level

'meta' box and may be used as described hereinafter. In HEIF, the handler value of the Handler box of the 'meta' box is 'picf . The resource (whether within the same file, or in an external file identified by a uniform resource identifier) containing the coded media data is resolved through the Data Information (’dinf) box, whereas the Item Location (’iloc’) box stores the position and sizes of every item within the referenced file. The Item Reference ('iref ) box documents relationships between items using typed referencing. If there is an item among a collection of items that is in some way to be considered the most important compared to others then this item is signaled by the Primary Item ('pitm') box. Apart from the boxes mentioned here, the 'meta' box is also flexible to include other boxes that may be necessary to describe items.

[0074] Any number of image items can be included in the same file. Given a collection of images stored by using the 'meta' box approach, certain relationships may be qualified between images. Examples of such relationships include indicating a cover image for a collection, providing thumbnail images for some or all of the images in the collection, and associating some or all of the images in a collection with an auxiliary image such as an alpha plane. A cover image among the collection of images is indicated using the 'pitm' box. A thumbnail image or an auxiliary image is linked to the primary image item using an item reference of type’thmb¹ or 'auxf, respectively.

[0075] The ltemPropertiesBox enables the association of any item with an ordered set of item properties. Item properties are small data records. The ltemPropertiesBox consists of two parts: an ItemPropertyContainerBox that contains an implicitly indexed list of item properties, and one or more ItemPropertyAssociationBox(es) that associate items with item properties.

An item property is formatted as a box.

[0076] A descriptive item property may be defined as an item property that describes rather than transforms the associated item. A transformative item property may be defined as an item property that transforms the reconstructed representation of the image item content.

[0077] An entity may be defined as a collective term of a track or an item. An entity group is a grouping of items, which may also group tracks. An entity group can be used instead of item references, when the grouped entities do not have clear dependency or a directional reference relation. The entities in an entity group share a particular characteristic or have a particular relationship, as indicated by the grouping type.

[0078] An entity group is a grouping of items, which may also group tracks. The entities in an entity group share a particular characteristic or have a particular relationship, as indicated by the grouping type.

[0079] Entity groups are indicated in GroupsListBox. Entity groups specified in a GroupsListBox of a file-level MetaBox refer to tracks or file-level items. Entity groups specified in a GroupsListBox of a movie-level MetaBox refer to movie-level items. Entity groups specified in a GroupsListBox of a track-level MetaBox refer to track-level items of that track.

[0080] A GroupsListBox contains EntityToGroupBoxes, each specifying one entity group. The syntax of EntityToGroupBox may be specified as follows: aligned(8) class EntityToGroupBox(grouping_type, version, flags)

extends FullBox(grouping_type, version, flags) {

unsigned int(32) group_id;

unsigned int(32) num_entities_in_group;

for(i=0; i<num_entities_in_group; i++)

unsigned int(32) entity_id;

// the remaining data may be specified

// for a particular grouping type

}

wherein entity_id is resolved to an item, when an item with item_ID equal to entity id is present in the hierarchy level (file, movie or track) that contains the GroupsListBox, or to a track, when a track with track_ID equal to entity_id is present and the GroupsListBox is contained in the file level.

[0081] Region- wise packing information may be encoded as metadata in or along the video bitstream, e.g., in a RegionWisePackingBox of OMAF and/or in a region-wise packing SEI message of HEYC or H.264/AVC. For example, the packing information may comprise a region-wise mapping from a pre-defined or indicated source format to the packed frame format, e.g., from a projected picture to a packed picture, as described earlier.

[0082] Region-wise packing may for example be rectangular region-wise packing. An example rectangular region-wise packing metadata is described next in the context of associating regions of an omnidirectional projected picture (e.g., of equirectangular or cubemap projection) with regions of a packed picture (e.g., being subject to encoding or decoding): For each region, the metadata defines a rectangle in a projected picture, the respective rectangle in the packed picture, and an optional transformation of rotation by 90, 180, or 270 degrees and/or horizontal and/or vertical mirroring. Rectangles may for example be indicated by the locations of the top-left comer and the bottom-right comer. The mapping may comprise resampling. As the sizes of the respective rectangles can differ in the projected and packed pictures, the mechanism infers region-wise resampling.

[0083] A Multipurpose Internet Mail Extension (MIME) is an extension to an email protocol which makes it possible to transmit and receive different kinds of data files on the Internet, for example video and audio, images, software, etc. An internet media type is an identifier used on the Internet to indicate the type of data that a file contains. Such internet media types may also be called content types. Several MIME type/subtype combinations exist that can contain different media formats. Content type information may be included by a transmitting entity in a MIME header at the beginning of a media transmission. A receiving entity thus may need to examine the details of such media content to determine if the specific elements can be rendered given an available set of codecs. Especially when the end system has limited resources, or the connection to the end system has limited bandwidth, it may be helpful to know from the content type alone if the content can be rendered.

[0084] Internet Engineering Tasking Force (IETF) Request For Comments (RFC) 6381 specifies two parameters, namely, 'codecs' and 'profiles', that are used with various MIME types or type/subtype combinations to allow for unambiguous specification of the codecs employed by the media formats contained within, or the profile(s) of the overall container format.

[0085] By labelling, in the codecs MIME parameter, content with the specific codecs indicated to render the contained media, receiving systems can determine if the codecs are supported by the end system, and if not, can take appropriate action (such as rejecting the content, sending notification of the situation, transcoding the content to a supported type, fetching and installing the required codecs, further inspection to determine if it will be sufficient to support a subset of the indicated codecs, etc.).

[0086] For file formats derived from the ISOBMFF, the codecs parameter specified in RFC 6381 is derived from the sample entry type(s) of the track(s) of the file described by the MIME type.

[0087] Similarly, the profiles parameter specified in RFC 6381 can provide an overall indication, to the receiver, of the specifications with which the content complies. This is an indication of the compatibility of the container format and its contents to some specification. The receiver may be able to work out the extent to which it can handle and render the content by examining to see which of the declared profiles it supports, and what they mean. For file formats derived from the ISOBMFF, the profiles MIME parameter contains a list of four- character codes that are indicated as compatible brands in the FileTypeBox of the file described by the MIME type.

[0088] Regardless of the file format of the video bitstream, the apparatus of an example embodiment may be provided by any of a wide variety of computing devices including, for example, a video encoder, a video decoder, a computer workstation, a server or the like, or by any of various mobile computing devices, such as a mobile terminal, e.g., a smartphone, a tablet computer, a video game player, or the like.

[0089] Regardless of the computing device that embodies the apparatus, the apparatus 10 of an example embodiment includes, is associated with or is otherwise in communication with processing circuitry 12, a memory 14, a communication interface 16 and optionally, a user interface 18 as shown in Figure 1.

[0090] The processing circuitry 12 may be in communication with the memory device 14 via a bus for passing information among components of the apparatus 10. The memory device may be non-transitory and may include, for example, one or more volatile and/or non volatile memories. In other words, for example, the memory device may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processing circuitry). The memory device may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present disclosure. For example, the memory device could be configured to buffer input data for processing by the processing circuitry. Additionally or alternatively, the memory device could be configured to store instructions for execution by the processing circuitry.

[0091] The apparatus 10 may, in some embodiments, be embodied in various computing devices as described above. However, in some embodiments, the apparatus may be embodied as a chip or chip set. In other words, the apparatus may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard). The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The apparatus may therefore, in some cases, be configured to implement an embodiment of the present disclosure on a single chip or as a single“system on a chip.” As such, in some cases, a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein.

[0092] The processing circuitry 12 may be embodied in a number of different ways. For example, the processing circuitry may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing element with or without an accompanying DSP, or various other circuitry including integrated circuits such as, for example, an ASIC (application specific integrated circuit), an FPGA (field programmable gate array), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processing circuitry may include one or more processing cores configured to perform independently. A multi-core processing circuitry may enable multiprocessing within a single physical package. Additionally or alternatively, the processing circuitry may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.

[0093] In an example embodiment, the processing circuitry 12 may be configured to execute instructions stored in the memory device 14 or otherwise accessible to the processing circuitry. Alternatively or additionally, the processing circuitry may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processing circuitry may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present disclosure while configured accordingly. Thus, for example, when the processing circuitry is embodied as an ASIC, FPGA or the like, the processing circuitry may be specifically configured hardware for conducting the operations described herein.

Alternatively, as another example, when the processing circuitry is embodied as an executor of instructions, the instructions may specifically configure the processor to perform the algorithms and/or operations described herein when the instructions are executed. However, in some cases, the processing circuitry may be a processor of a specific device (e.g., an image or video processing system) configured to employ an embodiment of the present invention by further configuration of the processing circuitry by instructions for performing the algorithms and/or operations described herein. The processing circuitry may include, among other things, a clock, an arithmetic logic unit (ALU) and logic gates configured to support operation of the processing circuitry.

[0094] The communication interface 16 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data, including video bitstreams. In this regard, the communication interface may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface may include the circuitry for interacting with the antenna(s) to cause transmission of signals via the antenna(s) or to handle receipt of signals received via the antenna(s). In some environments, the

communication interface may alternatively or also support wired communication. As such, for example, the communication interface may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL), universal serial bus (USB) or other mechanisms.

[0095] In some embodiments, such as in instances in which the apparatus 10 is configured to encode the video bitstream, the apparatus 10 may optionally include a user interface 18 that may, in turn, be in communication with the processing circuitry 12 to provide output to a user, such as by outputting an encoded video bitstream and, in some embodiments, to receive an indication of a user input. As such, the user interface may include a display and, in some embodiments, may also include a keyboard, a mouse, a joystick, a touch screen, touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms. Alternatively or additionally, the processing circuitry may comprise user interface circuitry configured to control at least some functions of one or more user interface elements such as a display and, in some embodiments, a speaker, ringer, microphone and/or the like. The processing circuitry and/or user interface circuitry comprising the processing circuitry may be configured to control one or more functions of one or more user interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processing circuitry (e.g., memory device 14, and/or the like).

[0096] When describing HEVC and in certain example embodiments, common notation for arithmetic operators, logical operators, relation operators, bit-wise operators, assignment operators, and range notation e.g., as specified in HEVC may be used. Furthermore, common mathematical functions e.g., as specified in HEVC may be used and a common order or precedence and execution order (from left to right or from right to left) of operators e.g., as specified in HEVC may be used.

[0097] When describing example embodiments that utilize HEVC, only the relevant parts of syntax structures are included in the description of the syntax and semantics. However, other embodiments could be realized similarly with a different order and names of syntax elements.

[0098] When describing example embodiments, the term file is sometimes used as a synonym of syntax structure or an instance of a syntax structure. In other contexts, the term file may be used to mean a computer file, that is a resource forming a standalone unit in storage.

[0099] When describing HEVC and in certain example embodiments, a syntax structure may be specified as described below. A group of statements enclosed in curly brackets is a compound statement and is treated functionally as a single statement. A "while" structure specifies a test of whether a condition is true, and if true, specifies evaluation of a statement (or compound statement) repeatedly until the condition is no longer true. A "do ... while" structure specifies evaluation of a statement once, followed by a test of whether a condition is true, and if true, specifies repeated evaluation of the statement until the condition is no longer true. An "if ... else" structure specifies a test of whether a condition is true, and if the condition is true, specifies evaluation of a primary statement, otherwise, specifies evaluation of an alternative statement. The "else" part of the structure and the associated alternative statement is omitted if no alternative statement evaluation is needed. A "for" structure specifies evaluation of an initial statement, followed by a test of a condition, and if the condition is true, specifies repeated evaluation of a primary statement followed by a subsequent statement until the condition is no longer true. [00100] Volumetric video data represents a three-dimensional scene or object and can be used as input for AR, VR and MR applications. Such data describes geometry (shape, size and position in 3D-space) and respective attributes (e.g. color, opacity, reflectance, etc.), plus any possible temporal changes of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video is either generated from 3D models, i.e. CGI, or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, a combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Typical representation formats for such volumetric data are triangle meshes, point clouds, or voxels. Temporal information about the scene can be included in the form of individual capture instances, e.g.,“frames” in 2D video, or other mechanisms, e.g., position of an object as a function of time.

[00101] Because volumetric video represents a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR, or MR applications, especially for providing 6DOF viewing capabilities.

[00102] Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding the 3D data as set of texture and depth map as is the case in the multi- view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.

[00103] In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations need to be stored or interchanged between computing entities, efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes and voxels, suffer from poor temporal compression performance. Identifying correspondences for motion- compensation in 3D-space is an ill-defined problem, as both geometry and respective attributes may change. For example, temporal successive“frames” do not necessarily have the same number of meshes, points or voxels. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, e.g., multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.

[00104] Because of the limitations of the two approaches mentioned above, a third approach is developed. A 3D scene, represented as meshes, points, and/or voxels, can be projected onto one, or more, geometries. These geometries are“unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the encoding format).

[00105] Projecting volumetric models onto 2D planes allows for using 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency is increased greatly. Using geometry-projections instead of 2D-video based approaches, e.g., multiview and depth, provide a better coverage of the scene (and/or object). Thus, 6DOF capabilities are improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time

compression/decompression of the projected planes. The projection and reverse projection steps are of relatively low complexity.

[00106] Figures 2A and 2B provide an overview of the compression and decompression processes, respectively, of point clouds. As illustrated in block 202, an apparatus, such as apparatus 10 of Figure 1, includes means, such as the processing circuitry 12, for patch generation after receiving input point cloud frames. The patch generation process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. One example way of patch generation is specified in MPEG document Point Cloud Coding Test Model Category 2 version 0.0 (TMC2V0). The approach is described as follows.

[00107] First, the normal at every point is estimated. An initial clustering of the point cloud is then obtained by associating each point with one of the following six oriented planes, defined by their normal:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1.0, 0.0, 0.0),

(0.0, -1.0, 0.0), and

- (0.0, 0.0, -1.0).

[00108] Each point is associated with the plane that has the closest normal (e.g., maximizes the dot product of the point normal and the plane normal). The initial clustering is then refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbours. The final step consists of extracting patches by applying a connected component extraction procedure.

[00109] As illustrated in block 204, the apparatus includes means, such as the processing circuitry 12, for packing. The packing process aims at mapping the extracted patches onto a 2D grid while trying to minimize the unused space and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch. T is a user-defined parameter that is encoded in a bitstream and sent to a decoder. One example packing strategy is specified in TMC2vO.

[00110] The packing strategy specified in TMC2vO is a simple packing strategy that iteratively tries to insert patches into a WxH grid. W and H are user defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping- free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch then the height H of the grid is temporarily doubled and the search is conducted again. At the end of the process, H is clipped so as to fit the used grid cells.

[00111] As illustrated in blocks 206A and 206B, the apparatus includes means, such as the processing circuitry 12, for image generation. The image generation process utilizes the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch is projected onto two images, referred to as layers. More precisely, let H(u,v) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called the near layer, stores the point of H(u,v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u,v) with the highest depth within the interval [DO, DO+D], where D is a user-defined parameter that describes the surface thickness. In some embodiments, the generated videos may have a geometry video of WxH YUV420-8bit and texture of WxH YUV420-8bit. YUV is a color encoding system where Y stand for luminance component and UV stand for two chrominance components.

[00112] As illustrated in block 208, the apparatus includes means, such as the processing circuitry 12, for image padding. The apparatus is configured to pad the geometry and texture images. The padding process aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. One example way of padding is provided in TMC2vO. The padding strategy in TMC2vO is specified as follows: • Each block of TxT (e.g., 16x16) pixels is processed independently.

• If the block is empty (that is, all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order.

• If the block is full (that is, no empty pixels), nothing is done.

• If the block has both empty and filled pixels, then the empty pixels are iteratively filled with the average value of their non-empty neighbors.

[00113] As illustrated in block 210, the apparatus further includes means, such as the processing circuitry 12, for video compression. The apparatus is configured to compress the padded texture images and padded geometry images into geometry video and texture video. The generated images/layers are stored as video frames and compressed using a video codec. One example video codec used is the HEVC Test Model 16 (HM16) video codec according to the HM16 configurations provided as parameters. As illustrated in block 212, the apparatus includes means, such as the processing circuitry 12, for auxiliary patch- info compression. The following metadata may be encoded for every patch:

• Index of the projection plane

o Index 0 for the planes (1.0, 0.0, 0.0) and (-1.0, 0.0, 0.0)

o Index 1 for the planes (0.0, 1.0, 0.0) and (0.0, -1.0, 0.0)

o Index 2 for the planes (0.0, 0.0, 1.0) and (0.0, 0.0, -1.0).

• 2D bounding box (uO, vO, ul, vl)

• 3D location (xO, yO, zO) of the patch represented in terms of depth 80, tangential shift sO and bi-tangential shift rO. According to the chosen projection planes, (80, sO, rO) are computed as follows:

o Index 0, 80= xO, s0=z0 and rO = yO

o Index 1 , 80= yO, s0=z0 and rO = xO

o Index 2, 80= zO, s0=x0 and rO = yO

[00114] Mapping information providing for each TxT block its associated patch index may be encoded as follows:

• For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.

• The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.

• Let I be index of the patch to which belongs the current TxT block and let J be the position of I in L. Instead of explicitly encoding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency.

[00115] As illustrated in block 214, the apparatus may further include means, such as the processing circuitry 12, for compressing occupancy map. As illustrated in block 216, the apparatus further includes means, such as the processing circuitry 12, for multiplexing the compressed geometry video, compressed texture video, compressed auxiliary patch information, and compressed occupancy map into a point cloud compression (PCC) coded bitstream. A compressed point cloud is comprised of: an encoded video bitstream corresponding to the texture of the 3D object; an encoded video bitstream corresponding to the geometry of the 3D object; and an encoded metadata bitstream corresponding to all other auxiliary information which is necessary during the composition of 3D object. These three information streams are multiplexed together and later utilized together in order to decode 3D object point cloud data at a given instance in time. Figure 3 provides a graphical example of the information.

[00116] The multiplexing representation is at bitstream level and some features need to be implemented, such as: 1) time-wise random access and calculation of exact byte location for such random access; 2) compact representation of metadata when it is repeated for group of frames; 3) file storage mechanism and delivery via mechanisms such as ISOBMFF and MPEG-DASH; and 4) Alternative representations of video and auxiliary data streams for delivery channel adaptation (quality, resolution, frame-rate, bitrate, etc.). Details regarding the multiplexing, storage and signaling mechanisms of the PCC coded bitstream are later described in conjunction with reference to Figures 4 to 7.

[00117] Referring now to Figure 4, the operations performed, such as by the apparatus 10 of Figure 1, in order to store the PCC coded bitstream in accordance with an example embodiment are depicted. As shown in block 40, the apparatus includes means, such as the processing circuitry 12, for accessing the point cloud compression coded bitstream. The point cloud compression coded bitstream includes a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream. As shown in block 42, the apparatus includes means, such as the processing circuitry 12, for causing storage of the point cloud compression coded bitstream. Details regarding block 42 are later described in conjunction with Figures 5 to 9.

[00118] Referring now to Figure 5, the operations performed, such as by the apparatus 10 of Figure 1, in order to cause storage of the point cloud compression coded bitstream in accordance with an example embodiment are depicted. In this embodiment, each sub component of PCC encoded bitstream (e.g., texture, geometry and auxiliary metadata) is stored as a separate track (2 video tracks and one timed metadata track). These tracks are grouped together in order to indicate a logical relationship between them. This grouping may be done using the track grouping mechanism by defining a new group type or entity grouping mechanism of ISOBMFF at a movie or file level. Alternatively or additionally, a track referencing mechanism may be utilized by the track reference definition in ISOBMFF in order to indicate the type of relationship between the tracks which comprise a PCC coded 3D representation. Representation-level global metadata may be stored at the track level, movie level or file level using either relevant track containers such as sample entries, sample auxiliary information or metadata containers. Each group of frames (GoF) as defined in the PCC specification working draft may be considered as a random access granularity structure. Hence, each PCC metadata track’s first sample may be a sync sample indicating a random access location.

[00119] As shown in block 50, the apparatus includes means, such as the processing circuitry 12, for accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream. As shown in block 52, the apparatus includes means, such as the processing circuitry 12, for storing the texture information bitstream in a texture video media track. As shown in block 54, the apparatus includes means, such as the processing circuitry 12, for storing the geometry information bitstream in a geometry video media track. As shown in block 56, the apparatus includes means, such as the processing circuitry 12, for storing the auxiliary metadata information bitstream in a timed metadata track. As shown in block 58, the apparatus includes means, such as the processing circuitry 12, the memory 14 or the like, for associating the texture video media track, the geometry video media track, and the timed metadata track as one group and for storing the texture video media track, the geometry video media track, and the timed metadata track in one group. The timed metadata track may be referred as a PCC metadata track. The term PCC track may refer to the texture video media track, the geometry video media track, or the PCC metadata track.

[00120] In an embodiment related to ISOBMFF, a brand may be defined to indicate that the file contains PCC encoded bitstreams. A possible four-character code for the brand could be, but is not limited to‘pccf . The brand may be included in a FileTypeBox and/or may be used in signalling, such as within the value for the optional profiles MIME parameter. The brand may be present in a major, minor or compatible brands list. In an embodiment, the brand may be present in the track level, e.g., in a TrackTypeBox, if a track level brand storage mechanism is present. In an embodiment, the brand may be present in the group level, e.g., in the EntityToGroupBox that groups the texture, geometry, and PCC metadata tracks, if a group level brand storage mechanism is present. A new media handler type for the PCC metadata track may be defined as“pcch” and stored in the handler_type field of the HandlerBox (“hdlr”) in the MediaBox of the PCC metadata track. The structure as defined in ISOBMFF could be as follows

aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) {

unsigned int(32) pre_defined = 0;

unsigned int(32) handler type;

const unsigned int(32)[3] reserved = 0;

string name;

} wherein handler type defines the media handler type for PCC metadata track. This field may contain the 4-character code“pcch”. In the syntax structure above as well as in other example syntax structures, syntax elements may be assigned to have a pre-defined value that the file author is required to obey. For example, in the syntax structure above, a file writer is required to set the value of the syntax element pre_defined to 0. Similarly, variables or input parameters for functions, such as version in the syntax above, may have predefined values.

For example, the value of the version parameter is required to be equal to 0 in the syntax above. However, such assignments to predefined values may not be necessary for other embodiments, and certain embodiments can be similarly realized with other predefined values or without the syntax elements, variables, or input parameters assigned to predefined values.

In one embodiment, PCC global bitstream information, such as but not limited to global scaling, translation and rotation, could be stored in a separate box/full box. Such a box may be called a PCCInfoBox with a 4-character code“pcci”. An example is provided as follows: aligned(8) class PCCInfoBox extends FullBox(‘pcci’, version = 0, 0) {

unsigned int(32)[3] global_scaling;

unsigned int(32)[3] globaljxanslation;

unsigned int(32)[4] global_rotation;

unsigned int(8) number_of_layers;

}

[00121] global_scaling and global_translation are defined as independent values matching each global axis x, y, and z. global_rotation is represented as a quaternion (w,x,y,z). The benefit of using quaternions instead of Euler angles is that quaternions are more univocal as they do not depend on the order of Euler rotations. Also, quaternions are not vulnerable to “gimbal locking”. With global_scaling, global translation and global rotation, a

transformation matrix can be generated to arbitrarily transform point cloud data in space. The benefit of providing these values separately instead of a holistic transformation matrix is developer convenience and the possibility to ignore certain components of transformation.

The number_of_layers indicates the total number of layers.

[00122] A syntax element PCCIInfoBox may be present in the Handler Box of the PCC metadata track or the Media Header Box of the PCC metadata track. These boxes may be extended as a new box or extended with a new full box version or extended with a flag definition. In order to include the PCCIInfoBox, example implementations of this new data structure could be as follows: aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) {

unsigned int(32) pre defined = 0;

unsigned int(32) handler_type;

const unsigned int(32)[3] reserved = 0;

string name;

if (handler_type == "pcch")

PCCInfoBox();

}

aligned(8) class HandlerBox extends FullBox(‘hdlr’, version, 0) {

unsigned int(32) pre defined = 0;

unsigned int(32) handler_type;

const unsigned int(32)[3] reserved = 0;

string name;

if (version == 1)

PCCInfoBox();

aligned(8) class HandlerBox extends FullBox(‘hdlr , version = 0, flags) {

unsigned int(32) pre defined = 0;

unsigned int(32) handler_type;

const unsigned int(32)[3] reserved = 0;

string name;

if (flags & 1)

PCCInfoBox();

}

[00123] In the first syntax structure above, the presence of PCCInfoBox is conditioned by the handler type being "pcch". In the second syntax structure above, the presence of

PCCInfoBox is conditioned by the version field of the box header being equal to 1. In the third syntax structure above, the presence of PCCInfoBox is conditioned by the flags field of the box header having a particular bit equal to 1.

[00124] In the syntax structure above and in other example syntax structures, "&" may be defined as a bit-wise "and" operation. When operating on (signed) integer arguments, the bit wise "and" operation operates on a two's complement representation of the integer value. When operating on a binary argument that contains fewer bits than another argument, the shorter argument is extended by adding more significant bits equal to 0.

[00125] In another embodiment, the PCCInfoBox contents may be stored in the PCC metadata sample entry. A new sample entry type of“pcse” may be defined. An example could be as follows:

class PCCMetadataSampleEntryO extends MetaDataSampleEntry(‘pcse’) {

PCCInfoBox();

Box[] other boxes; // optional

} [00126] In another embodiment, the PCC bitstream global information may be stored in the file/movie/track level MetaBox, utilizing the ItemlnfoBox. The item which contains the PCC global bitstream information may utilize its own MIME type such as but not limited to‘pcci’ which is stored in the item_type field of the ItemInfoEntry() box. Additionally, a new entity grouping may be defined in EntityToGroupBox() contained in a MetaBox. A new Entity grouping may be defined such as‘pcgr’. PCC metadata tracks, video and texture media tracks and related metadata items may be grouped together inside the defined‘pcgr’ grouping. An example can be represented as follows: Texture video track id = 1; Geometry video track id = 2; PCC metadata track id = 3; and PCC bitstream global information stored as a metadata item with item_id =101.

[00127] A new Entity Grouping may be defined with the following syntax: aligned(8) class EntityToGroupBox(grouping_type, version, flags)

extends FullBox(grouping_type, version, flags) {

unsigned int(32) group_id;

unsigned int(32) num_entities_in_group;

for(i=0; i<num_entities_in_group; i++)

unsigned int(32) entity_id;

// the remaining data may be specified for a particular grouping type

}

[00128] grouping_type identifies the new entity grouping type. A new grouping_type four- character code such as but not limited to‘pcgr’ may be defined for the entity grouping group id represents an identifier for the group in the form of a numerical value.

num_entities_in_group represents the number of entities in group, which is set to 4 in this example (texture track, geometry video track, PCC metadata track, and PCC bitstream global information stored as a metadata item) listed entity_ids is a list of the entity ids for the entities which in this example is [1, 2, 3, 101]

[00129] In one embodiment, the order of these entities may be predefined such as (texture, geometry, pcc metadata, pcc global information} . The order may be any defined order. In case there is no order definition, the handler types of the tracks and item types may be used to identify the types of the grouped entities.

[00130] In another embodiment, an EntitytoGroupBox box may be extended to store the PCClnfoBox. The syntax may be as follows: aligned(8) class EntityToGroupBox(grouping_type, version, flags)

extends FullBox(grouping_type, version, flags) {

unsigned int(32) group_id;

unsigned int(32) num_entities_in_group;

for(i=0; i<num_entities_in_group; i++)

unsigned int(32) entity_id;

// the remaining data may be specified for a particular grouping type

if (grouping_type ==‘pcgr’)

PCCInfoBox()

}

[00131] In such a case, there may be no need to define an item type of‘pcci’ in order to store the PCC global information.

[00132] In order to indicate the grouping relationship of PCC tracks, a track grouping may be defined. Such a track grouping information may be stored in each related media and timed metadata track or a subset of such tracks. Track Grouping may be defined using track grouping mechanism as follows:

aligned(8) class TrackGroupBox extends Box('trgr') {

}

aligned(8) class TrackGroupTypeBox(unsigned int(32) track group typc) extends FullBox(track_group_type, version = 0, flags = 0)

{

unsigned int(32) track_group_id;

// the remaining data may be specified for a particular track group typc

}

[00133] A new track_group_type such as but not limited to‘pctg’ may be defined for the PCC track grouping. In another embodiment, PCC global information may be embedded inside the TrackGroupTypeBox as follows:

{

unsigned int(32) track_group_id;

// the remaining data may be specified for a particular track_group_type

if (track_group_type ⁼⁼‘pctg’)

PCCInfoBoxQ

[00134] Media and time metadata tracks which have the same {track_group_type, track group id} pair could belong to the same PCC 3D representation.

[00135] In some embodiments, in addition to or instead of the track grouping mechanism and/or the entity grouping mechanism described above, as shown in block 59, a track referencing mechanism may be utilized to indicate the PCC track relationships. As shown in block 59, the apparatus includes means, such as the processing circuitry 12, for associating the auxiliary timed metadata track with the track group defined before by using track referencing. TrackReferenceBox with the following ISOBMFF-defined syntax may be utilized as follows: aligned(8) class TrackReferenceBox extends Box('tref ) {

TrackReferenceTypeBox [];

}

aligned(8) class TrackReferenceTypeBox (unsigned int(32) reference type) extends Box(reference type) {

unsigned int(32) track_IDs[];

}

[00136] The below-used 4-character codes below are listed as examples. Any set of the listed track references or alike may be present in a file storing PCC tracks.

A PCC texture track may reference the PCC metadata track with a reference type of ‘pcct’.

A PCC geometry track may reference the PCC metadata track or a PCC texture track with a reference_type of‘pccg’ .

A PCC metadata track may reference the PCC texture and geometry tracks with a reference_type of‘pccm’. In such a case, both the texture and the geometry track ids could be listed in the TrackReferenceBox with type‘pccm’ of the PCC metadata track.

[00137] In some alternative embodiments, a texture or geometry track may reference multiple metadata tracks if such tracks are represented by the same PCC metadata track but differ in certain characteristics such as bitrate, resolution, video quality, etc. Track alternatives may also be derived using the ISOBMFF-provided alternative track mechanisms. For example, the same altemate group value of TrackHeaderBox could be used for tracks that represent the same data with different characteristics.

[00138] In some alternative embodiments, a PCC metadata track may have two different track references pointing to the texture and geometry video tracks, such as a

TrackReferenceTypeBox with‘pcct’ reference_type to point to the texture video track and a TrackReferenceTypeBox with‘pccg’ reference type to point to the geometry video track.

[00139] In another embodiment, video and geometry tracks may be packed to a single video frame. In such a case, the packed video frames which are stored in a video track may have a TrackReferenceTypeBox with‘peep” reference type to point to the PCC metadata track or vice-versa (e.g., a PCC metadata track containing a TrackReferemceTypeBox with ‘peep’ reference_type and pointing to the PCC packed video track.)

[00140] In another embodiment, the texture and geometry tracks may be grouped using track grouping type of (but not limited to)“petg” with a particular track_group_id. The auxiliary metadata of PCC may be stored as a timed metadata track with a dedicated handler type such as (but not limited to)“pccm” and reference to the track group id by utilizing the “cdtg” track reference type. This may indicate the PCC auxiliary timed metadata describes the texture and geometry tracks collectively. [00141] In some embodiments, a PCC metadata track may contain PCC auxiliary information samples in the following methods. In one embodiment, the first sample of each

PCC Group of Frames (GOF) may contain the GOF header information and the auxiliary metadata bitstream corresponding to the texture and geometry samples of the same composition time. In another embodiment, such samples may be marked as“sync” samples, e.g., by utilizing the sync sample box (a.k.a. SyncSampleBox). In such a case, random access could be possible at each GOF boundary. All other PCC metadata samples may contain the corresponding PCC auxiliary metadata of a texture of geometry sample in their corresponding texture and geometry tracks. In another embodiment, each PCC metadata sample may have a corresponding sample auxiliary information in the sample table box which may store the GOF header bitstream. This auxiliary sample information may be defined only for the sync samples of PCC metadata track or every sample of the PCC metadata track. In another embodiment,

GOF Header information may be stored in the PCC metadata track sample entries. Each GOF may then refer to the sample entry index which contains the related GOF Header bitstream information. A possible data structure could be as follows:

class PCCMetaDataSampleEntry() extends MctaDataSamplcEntryCpcsc’) {

PCCInfoBox(); //may be present optionally

GOFHeaderBox() ;

Box[] other boxes; // optional

}

[00142] GOFHeaderBox may have the following syntax:

aligned(8) class GOFHeaderBox extends FullBox(‘gofh’, version = 0, 0) {

unsigned int(8) groupofFramesSize;

unsigned int(16) frame Width;

unsigned int( 16) frameHeight;

unsigned int(8) occupancyResolution;

unsigned int(8) radiusToSmoothing;

unsigned int(8) neighborCountSmoothing;

unsigned int(8) radius2BoundaryDetection;

unsigned int(8) thresholdSmoothing;

}

[00143] groupOfFramesSize indicates the number of frames in the current group of frames. frameWidth indicates the geometry or texture picture width. frameHeight indicates the geometry/Texture picture height. occupancyResolution indicates the occupancy map resolution. radiusToSmoothing indicates the radius to detect neighbours for smoothing.

neighborCountSmoothing indicates the maximum number of neighbours used for smoothing. radius2BoundaryDetection indicates the radius for boundary point detection.

thresholdSmoothing indicates the smoothing threshold.

[00144] In some embodiments, a separate timed metadata track may be defined to store the GOF header samples. In such a scenario, such a metadata track may have a separate handler type (e.g.,‘pcgh’) and a track reference to the PCC metadata track (e.g., of reference type ‘pcgr’). If such a track exists and track grouping is used to group texture, geometry and metadata tracks, then such a metadata track may also be listed in the track grouping.

[00145] In some embodiments, metadata samples that belong to a GOF may be grouped together in order to form sample groups and their related GOF header information may be stored in the SampleGroupDescriptionEntry in the SampleGroupDescriptionBox. One example syntax is provided below:

class PCCMetadataSampleGroupEntryO extends SampleGroupDescriptionEntry('pmsg') { GOFFIeaderBox()

}

[00146] In some embodiments, the 'pmsg' sample group or alike is present in the track containing the dynamic PCC metadata. In some embodiments, the 'pmsg' sample group or alike is present in any of the video tracks (e.g. the texture video track or the geometry video track).

[00147] In some cases, e.g. when encoding and streaming live feeds, it may be necessary to include new GOF headers in the bitstream after creating the initial MovieBox. In such cases, new GOF headers can be introduced in the SampleGroupDescriptionBox included in the MovieFragmentBox. It is remarked that introducing GOF headers in the

SampleGroupDescriptionBox included in the MovieFragmentBox may be used in but is not limited to encoding and streaming of live feeds.

[00148] A PCCMetadata sample may contain GOF Header information, GOF Auxiliary information as well as GOF occupancy map information. In some embodiments, the GOF Header information, GOF Auxiliary information as well as GOF occupancy map information is defined as their corresponding parts in the PCC Category 2 coding Working Draft. The GOF Header information was provided above. Example GOF auxiliary information includes the following:

[00149]

[00150] Referring now to Figure 6, the operations performed, such as by the apparatus 10 of Figure 1, in order to cause storage of the point cloud compression coded bitstream in accordance with an example embodiment are depicted. In this embodiment, to accommodate the situation where only one single video decoder instance is available at an instance of time, there is only one single video bitstream fed to the video decoder. [00151] In some embodiments, a spatial frame packing approach may be utilized to put texture and geometry video frames into a single video frame. A frame packing arrangement SEI message or another SEI message with similar content as the frame packing arrangement SEI message may be utilized to signal such packing in a video bitstream level. A restricted video scheme may be utilized to signal such packing at the file format level.

[00152] An alternative may be to perform a region-wise packing and create a“packed PCC frame”. Such an approach may enable different geometric transformations such as rotation, scaling, etc. to be applied on the texture and geometry frames. A new type of“region- wise packing” restricted scheme type may be defined and signaled in the SchemelnformationBox in ISOBMFF.

[00153] Another alternative may be utilizing a temporal interleaving mechanism for texture and geometry frames and making sure that the prediction dependencies of the frames are only from the same type of frames (e.g., either texture or geometry). In another embodiment, such a restriction may be relaxed. Such an encoding configuration may be signaled in the

SchemelnformationBox of the media track with a new type definition. The sample composition times for such interleaved frames would be the composition time of the second frame of the {texture, geometry} pair. Moreover, an indicator field may be defined to signal the type of the first video frame.

[00154] In another embodiment, a group of video frames of the same type may be consecutive to each other. In such a case, the sample composition times may be signaled on the consecutive frames of the same type and then applied on the same number of frames adjacent to the group of frames in decoding order (e.g., sample timings of N texture frames are applied to the next N geometry frames, which are then followed by N texture frames and so on). A new temporal packing type indicator may be defined to signal such an interleaving scheme.

[00155] As shown in block 60, the apparatus includes means, such as the processing circuitry 12, for accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream. As shown in block 62, the apparatus includes means, such as the processing circuitry 12, for packing the texture information bitstream and the geometry information bitstream into one video media track. As shown in block 64, the apparatus includes means, such as the processing circuitry 12, for storing the auxiliary metadata information in a timed metadata track. As shown in block 66, the apparatus includes means, such as the processing circuitry 12, the memory 14 or the like, for associating the video media track and the timed metadata track as one group and for storing the video media track and the time metadata track as one group. [00156] With regard to packing the texture information bitstream and the geometry information bitstream into one video media track, in some embodiments, texture and geometry video frames in the bitstream may be spatially packed to a single video frame. A new restricted video scheme may be defined for PCC encoded packed texture and geometry video frames. Such a scheme may be identified by a scheme type 4-character code of‘ppvs’ inside the RestrictedSchemelnfoBox. A new uniform resource identifier (URI) may be defined as the scheme uri and may be conditionally present depending on box flags. The packed PCC video frame information may be then contained in the SchemelnformationBox of the related video track. Such a box may be called a PCCPackedVideoBox with a 4-char code of‘ppvi’ . An example of such a box may be as follows:

aligned(8) class SchemeTypeBox extends FullBox('schm', 0, flags) {

// 4CC identifying the scheme which may be called‘ppvs’

unsigned int(32) scheme type;

unsigned int(32) scheme version; // scheme version

if (flags & 0x000001) {

utfKstring scheme_uri; // browser uri

}

[00157] The following box may reside inside the Scheme Information Box (‘schi’) and contain PCC packed video specific information:

aligned(8) class PCCPackedVideoBox extends FullBox(‘ppvi’, version = 0, 0)

{

PCCSpatialPackingBox()

Box[] any box; // optional

}

[00158] A box called PCCSpatialPackingBox() may define how texture and geometry patches could be spatially packed in a video frame, an example is provided below:

aligned(8) class PCCSpatialPackingBox extends FullBox('rwpk', 0, 0) {

PCCSpatialPackingStruct();

}

[00159] The PCCSpatialPackingStruct() may contain the data structure needed to identify each packed region’s type, dimensions and spatial location. Such a box may have the following structure:

aligned(8) class PCCSpatialPackingStruct() {

unsigned int(8) num_regions;

unsigned int(32) proj_texture_picture_width;

unsigned int(32) proj_texture_picture_height;

unsigned int(32) proj_geometry_picture_width;

unsigned int(32) proj_geometry_picture_height;

unsigned int(l6) packed_picture_width;

unsigned int(l6) packed_picture_height;

for (i = 0; i < num_regions; i++) {

bit(2) reserved = 0;

unsigned int(l) region type; // 0=texture; l=geometry unsigned int(l) guard_band_flag[i];

unsigned int(4) packing_type[i] ;

if (packing_type[i] == 0) {

RectRegionPacking(i) ;

if (guard_band_flag[i])

GuardBand(i);

}

[00160] num_regions defines how many packed regions are defined in this data structure. proj_texture_picture_width defines the width of the input video texture frame.

proj_texture_picture_height defines the height of the input video texture frame

proj gco mctry pi cturc width defines the width of the input geometry texture frame prqj gcomctry picturc hcight defines the height of the input geometry texture frame.

packed_picture_width defines the width of the output packed video frame.

packed_picture_height defines the height of the output packed video frame.

[00161] region_type indicates the type of the packed region region type = 0 may indicate the source of this region to be a texture video frame. region_type = 1 may indicate that the source of this region is a geometry video frame.

[00162] guard_band_flag indicates whether the reason has guard bands or not.

packing type indicates the type of region-wise packing. RectRegionPacking specifies the region- wise packing between packed region and the projected region.

[00163] In another embodiment, a new projection type such as“orthonormal projection” may be added to the ProjectionFormatBox as defined in the MPEG OMAF specification, which resides inside the Projected Omnidirectional Video (‘podv’) scheme’s

ProjectedOmnivideoBox(‘povd’). In such a case, a PCCPackedVideoBox may reside inside the ProjectedOmnivideoBox and its presence may be signaled by the projection type which will indicate an orthonormal projection.

[00164] In some embodiments, texture and geometry video frames are temporally interleaved to form a single video bitstream. In such a case, a video prediction scheme may be defined so that texture frames are only dependent on the other texture frames and geometry frames are also only dependent on the other geometry frames. The frame rate may be doubled. The composition time of the latter frame may be utilized to indicate the composition time of the temporally interleaved frames. In such a case, a new box called PCCVideoBox may be defined to indicate the temporal packing information. Such a box may reside inside the SchemelnformationBox, when the schemeType may be defined as‘pcci’ in order to indicate temporally interleaved PCC texture and geometry frames. An example is provided below: aligned(8) class PCCInterleavedVideoBox extends FullBox(‘pcci\ version = 0, 0) { unsigned int(8) packing_indication_type

}

[00165] packing indication type indicate the type of temporal or spatial packing. Example packing types and values associated are provided below:

0: spatial packing indicator. This value indicates that a spatial packing scheme is applied and the above-mentioned spatial packing mechanism is utilized on the texture and geometry PCC video frames.

1 : temporally interleaved, texture frame comes first

2: temporally interleaved, geometry frame comes first

[00166] In both temporally interleaved and spatially packed use cases, the track referencing mechanism between the auxiliary metadata track and the texture and geometry track described in conjunction with Figure 5 may apply ln such a case, a single track reference type; which may be called‘pctg’ may be used from/to a PCC auxiliary metadata track to/from a packed/interleaved PCC texture and geometry video track.

[00167] Referring now to Figure 7, the operations performed, such as by the apparatus 10 of Figure 1, in order to cause storage of the point cloud compression coded bitstream in accordance with an example embodiment are depicted. This embodiment is applied to still scenes by only encoding a single frame (e.g. a single time instance) of texture, geometry and auxiliary metadata ln an embodiment using HEVC, single video frames can be encoded as image items as defined in HEIF standard and be stored in the MetaBox of the file.

[00168] As shown in block 70, the apparatus includes means, such as the processing circuitry 12, for accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream. As shown in block 72, the apparatus includes means, such as the processing circuitry 12, for extracting one or more texture information frames, one or more geometry information frames, and one or more auxiliary metadata frames from the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream. As shown in block 74, the apparatus includes means, such as the processing circuitry 12, the memory 14 or the like, for storing the one or more texture information frames, the one or more geometry information frames, and the one or more auxiliary metadata frames as one or more image items.

[00169] In some embodiments using HEVC, in order to support such a mechanism, some data structures as described below may be defined.

[00170] A descriptive item property called PCCPackedFrameProperty may be defined.

This property may contain PCCSpatialPackingBox() as defined above. Such a property is then linked to the image item which contains the packed texture and geometry image item. One example syntax is provided below:

aligned(8) class PCCPackedFrameProperty

extends ItemFullProperty(‘ppfp, 0, 0) {

PCCSpatialPackingBox ()

[00171] A descriptive item property called PCCAuxiliaryProperty(‘pcap’) may be defined to include the PCC global parameters and GOF header information. One example syntax is provided below:

aligned(8) class PCCAuxiliaryProperty

extends ItemFullProperty(‘pcap’, 0, 0) {

PCCInfoBox();

GOFHeaderBox();

}

[00172] A new item type which is called PCC metadata item may be defined to contain the PCC auxiliary metadata information, which is referred to as a PCC auxiliary metadata sample in above sections. This item may be defined as type‘pcca’.

[00173] If no packing is defined or if the original PCC encoded geometry and texture images are also stored in the same media file, the following item properties may be defined:

o A descriptive item property called PCCTextureItemProperty(‘pcti’) may be defined to indicate the PCC encoded texture image item. This property may then be linked to the relevant image item.

o A descriptive item property called PCCGeometryItemProperty(‘pcgi^,) may be defined to indicate the PCC encoded geometry image item. This property may then be linked to the relevant image item.

o The above mentioned descriptive item property called PCCAuxiliaryProperty(‘pcap’) may be used to store the global header and GOF header information.

o The above-mentioned PCC metadata item and the two image items with respective item texture and geometry item properties may be grouped together using an Entity Grouping mechanism to indicate that they create a PCC encoded 3D representation. Such an entity grouping may have a type of‘pcce’ in the EntitytoGroupBox() where all the relevant item ids of texture, geometry images and PCC auxiliary metadata item may be listed to indicate a group.

[00174] In another embodiment, PCCInfoBox, GoFHeaderBox and optionally the contents of the PCC auxiliary metadata information may be listed in the EntityToGroupBox with grouping type=‘pcce’. An example of such a data structure is as follows: aligned(8) class EntityToGroupBox(grouping_type=‘pcce‘, version, flags) extends FullBox(grouping_type, version, flags) {

unsigned int(32) group_id;

unsigned int(32) num entities in group;

for(i=0; i<num_entities_in_group; i++)

unsigned int(32) entity id;

PCCInfoBox()//optional

GoFHeaderBox()//optional

PCCMetadataSample() //optional

}

[00175] In another embodiment, PCCInfoBox or alike is included in the

EntityToGroupBox with grouping type equal to 'pcce' (or some other four-character code indicating PCC) as above. GofHeaderBox or alike may be included in the file by any means described in other embodiments, such as including it in a sample group description entry. Similarly, PCC auxiliary metadata information may be included in the file by any means described in other embodiments.

[00176] In another embodiment, all or parts of PCC metadata may be stored in and signalled as SEI messages and carried as part of the video bitstream. Such information may contain the PCC auxiliary sample fields mapped to different or single SEI message and transmitted in conjunction with the texture and geometry elementary bitstreams. In such a case, PCC auxiliary metadata SEI messages may reside in one of the texture or geometry elementary streams and not repeated in both. In this embodiment, there may be no need to store the PCC auxiliary information or metadata information as separate properties or metadata image items.

[00177] In another embodiment, new NAL units may be defined to contain the PCC auxiliary metadata information. Such NAL units may then also be stored in the ISOBMFF compliant format as part of the video samples. Again, in such a case, such NAL units may be stored as part of either texture or geometry video tracks or as part of the packed or interleaved texture and geometry track. In this embodiment, there may be no need to store the PCC auxiliary information or metadata information as separate properties or metadata image items.

[00178] In another embodiment, NAL units containing PCC auxiliary metadata information and video NAL units may be combined into a single PCC stream and stored in a PCC track where each PCC sample is comprised of one or more texture, geometry and auxiliary information NAL units. In such a case, a single media track or image item may be present in the PCC media file.

[00179] Referring now to Figure 8A, the operations performed, such as by the apparatus 10 of Figure 1, in order to cause storage of the point cloud compression coded bitstream in accordance with an example embodiment are depicted. This embodiment is applied to an instance where the geometry, texture, occupancy-map and dynamic time-based metadata follow a compatible coding and timing structure. For instance, a subset of the geometry, texture, occupancy-map and dynamic time-based metadata may follow the same coding structure. Additionally, the geometry, texture, occupancy-map and dynamic time-based metadata may follow the same ti ing structure. In such an instance, a single track architecture may be utilized. A multiplexer may be able to multiplex the bitstreams such that coded access-units of geometry, texture, occupancy-map, and dynamic metadata required for reconstructing a point-cloud at one time-instant are adjacent to each other. The order of the collection of access-units may vary between implementations. This multiplexed bitstream is then encapsulated into an mdat box. A single track can consider the consecutive access-units (geometry, texture, occupancy-map, and dynamic metadata) of the multiplexed bitstream as a single PCC media sample. Indication of the sizes of the individual access-units can be signaled using the sub-sample information box that is tailored specifically for a PCC media sample.

[00180] As shown in block 80, the apparatus includes means, such as the processing circuitry 12, for accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream. As shown in block 82, the apparatus includes means, such as the processing circuitry 12, for detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure. As shown in block 84, the apparatus includes means, such as the processing circuitry 12, the memory 14 or the like, for multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped. As shown in block 86, the apparatus includes means, such as the processing circuitry 12, the memory 14 or the like, for causing construction and storage of a video media track, wherein the video media track comprises the multiplexed one or more coded access units. Figure 8B provides a graphical representation of an example track used to signal a multiplexed PCC bitstream.

[00181] In such a single track design, the video and timed metadata access-units may be interleaved. A coded geometry, texture, and timed-metadata of one frame follows another geometry, texture and timed-metadata of the another frame in sequence. A sample may be a coded geometry, texture, occupancy map and timed-metadata that is required to be decoded for presentation at one time-instant. Indication of sizes of each of the coded sub-parts of an individual sample can be signalled in either the sub-sample information box or stored as sample auxiliary information and referenced by the sample auxiliary information sizes box. [00182] In some embodiments using HEVC, in order to support such a mechanism, some data structures as described below may be defined.

class PCCConfigurationBox extends Box('pccC') {

HEV CDecoderConfigurationRecord() HEV CT extureConfig;

HEV CDecoderConfigurationRecord() HEVCGeometryConfig;

HEV CDecoderConfigurationRecord() HEV COccupancyConfig;

PCCInfoBox(); //may be present optionally

GOFHeaderBox(); //may be present optionally

}

[00183] HEVCDecoderConfigurationRecord() may be as defined in the ISO/IEC 14496- 15: Information technology— Coding of audio-visual objects— Part 15: Advanced Video Coding (AVC) file format, AMENDMENT 2: Carriage of high efficiency video coding (HEVC). In another embodiment, different codec configurations could be used together. For example, AVC and HEVC codecs could be utilized for different components and then listed together in the PCCConfigurationBox.

[00184] An example HEVCDecoderConfigurationRecord() is provided below:

aligned(8) class HEVCDecoderConfigurationRecord {

unsigned int(8) configurationVersion = 1;

unsigned int(2) general_profile_space;

unsigned int(l) general_tier_flag;

unsigned int(5) general_profile_idc;

unsigned int(32) gcneral profilc compatibility flags;

unsigned int(48) general_constraint_indicator_flags;

unsigned int(8) general_level_idc;

bit(4) reserved = M i l l’b;

unsigned int(l2) min_spatial_segmentation_idc;

bit(6) reserved =‘111111’b;

unsigned int(2) parallelismType;

bit(6) reserved =‘111111’b;

unsigned int(2) chromaFormat;

bit(5) reserved =‘11111’b;

unsigned int(3) bitDepthLumaMinus8;

bit(5) reserved =‘11111’b;

unsigned int(3) bitDepthChromaMinus8;

bit( 16) avgFrameRate;

bit(2) constantFrameRate;

bit(3) numTemporalLayers;

bit(l) temporal! dNested;

unsigned int(2) lengthSizeMinusOne;

unsigned int(8) numOfArrays;

for (j=0; j < numOfArrays; j++) {

bit(l) array_completeness;

unsigned int(l) reserved = 0;

unsigned int(6) NAL_unit_type;

unsigned int(l6) numNalus;

for (i=0; i< numNalus; i++) {

unsigned int(16) nalUnitLength; bit(8*nal Unit Length) nalUnit;

}

[00185] The HEVCDecoderConfigurationRecord() includes the size of the length field used in each sample to indicate the length of its contained NAL units as well as the parameter sets, if stored in the sample entry. The HEVCDecoderConfigurationRecord() is externally framed (its size must be supplied by the structure which contains it).

[00186] The values for general_profile_space, general_tier_flag, general_profile_idc, general_profile_compatibility_flags, general_constraint_indicator_flags, general_level_idc, and min_spatial_segmentation_idc may be valid for all parameter sets that are activated when the stream described by this record is decoded (referred to as "all the parameter sets" in the following sentences in this paragraph). Specifically, the following restrictions apply:

[00187] The value of general_profile_space in all the parameter sets may be identical. The tier indication general tier flag indicates a tier equal to or greater than the highest tier indicated in all the parameter sets. The profile indication general_profile_idc may indicate a profile to which the stream associated with this configuration record conforms.

[00188] Explicit indication may be provided in the HEVC Decoder Configuration Record about the chroma format and bit depth as well as other important format information used by the HEVC video elementary stream. Each type of such information is identical in all parameter sets, if present, in a single HEVC configuration record. If two sequences differ in any type of such information, two different HEVC configuration records are used. If the two sequences differ in color space indications in their VUI information, then two different configuration records are also required. There may be a set of arrays to carry initialization NAL units. The NAL unit types are restricted to indicate SPS, PPS, VPS, and SEI NAL units only.

[00189] In some embodiments, different codec configurations could be used together. For example, AVC and HEVC codecs could be utilized for different components and then listed together in the PCCConfigurationBox. In an embodiment, PCCConfigurationBox may contain a subset of the above-mentioned decoder. A PCC sample may then be defined by the PCCSampleEntry as defined below:

class PCCSampleEntryO extends VisualSampleEntry ('pecs') {

PCCConfigurationBoxconfig;

MPEG4BitRateBox (); // optional

MPEG4ExtensionDescriptorsBox (); // optional

Box extra boxes []; // optional

} [00190] The MPEG4BitRateBox signals bitrate information. The

MPEG4ExtensionDescriptorsBox includes extension descriptors.

[00191] In some embodiments, when a particular track PCC media component is stored in a separate track, the track grouping and referencing mechanisms previously described in conjunction with Figure 5 may apply.

[00192] Referring now to Figure 9 A, the operations performed, such as by the apparatus 10 of Figure 1, in order to cause storage of the point cloud compression coded bitstream in accordance with an example embodiment are depicted. This embodiment is applied to an instance for a fragmented ISOBMFF file. Each PCC media component is carried in a separate media track. PCC media components may be interleaved in any particular sample count and order. The smallest granularity could be interleaving of 1 PCC media component after another. Figure 9B provides a graphical representation of an example interleaving samples where texture (T), geometry (G), occupancy (O) and auxiliary (A) metadata samples from different tracks are interleaved inside the mdat box.

[00193] As shown in block 90, the apparatus includes means, such as the processing circuitry 12, for accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream. As shown in block 92, the apparatus includes means, such as the processing circuitry 12, for storing the texture information bitstream in a texture video media track. As shown in block 94, the apparatus includes means, such as the processing circuitry 12, for storing the geometry information bitstream in a geometry video media track. As shown in block 96, the apparatus includes means, such as the processing circuitry 12, for storing the auxiliary metadata information bitstream in a timed metadata track. As shown in block 98, the apparatus includes means, such as the processing circuitry 12, the memory 14 or the like, for causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track.

[00194] An example moof box is provided below:

InterleavedMediaIndicatorBox(‘imin’) // this is a new box present at moof or moov level traf {

trun(flag=0x001000) traf {

trun(flag=0x001000)

}

traf {

trun(flag=OxOO 1000) }

traf {

trun(flag=0x001000)

}

[00195] The newly defined InterleavedMediaIndicatorBox() may enable signalling of the following properties:

A media interleaving pattern in the mdat box which follows the same order as the ‘traf boxes in the moof box. The indicated traf boxes belong to the listed track IDs and may be listed consecutive to each other.

The trun boxes may be associated with example tr_flags=0x00l000. This indicates that the sample_size field inside the trun boxes indicates the sample sizes of represented samples of this track based on the interleaving pattern indicated by InterleavedMediaIndicatorBox(), rather than uninterrupted consecutive run of samples.

[00196] An example InterleavedMediaIndicatorBox() is provided below:

aligned(8) class InterleavedMedialndicatorBox extends FullBoxf imirf , version = 0, 0) { unsigned int(l) use default interleaving;

unsigned int(7) number_of_consecutive_samples;

if (use_default_interleaving == 0) {

unsigned int(8) track id count;

{

unsigned int(32) track ID;

} [track id count]

}

[00197] InterleavedMedialndicatorBox may be present in the“moov” or“moof’. If it is present in the“moov”, it defines a default interleaving strategy. use_default_interleaving, when equals to 1 indicates that the media interleaving pattern which is defined in“moov” box may be used number of consecutive samples indicates how many samples of the same track are consecutively present. If use_default_interleaving is 1, then this field takes the value 0. Otherwise a value greater than 0 may be present. track_id_count indicates the number of track IDs listed in the array. track ID is an integer providing the track identifier for which random access information is provided.

[00198] The‘trun’ box may be as defined in the ISOBMFF standard:

aligned(8) class TrackRunBox

extends FullBox('trun', version, tr flags) {

unsigned int(32) sample count;

// the following are optional fields

signed int(32) data offset; unsigned int(32) first_sample_flags;

// all fields in the following array are optional

// as indicated by bits set in the tr_flags

{

unsigned int(32) sample_duration;

unsigned int(32) sample_size;

unsigned int(32) sample_flags

if (version == 0)

{ unsigned int(32) sample composition time offset; }

else

{ signed int(32) sample composition time offset; }

} [ sample_count ]

}

[00199] The sample composition time offset of Track Run Box is presentation time of accesss unit in media processing unit. This value is offset from the start of MPU. Start presentation time of media processing unit is specified by Composition Information.

[00200] In some embodiments, the interleaving strategy may be defined using other means such as pre-defined enumerations or patterns which are signalled out of the band or in the file in a pre-defined box. In some embodiments, in case a single track is used to carry the PCC media data, SubSamplelnformationBox may be used to access each of the interleaved PCC components inside the PCC media sample. A new SampleGroupDescriptionBox may be defined to indicate the order and configuration of each of the PCC components.

[00201] In the above, where the example embodiments have been described with reference to an encoder, it needs to be understood that the resulting bitstream and the decoder may have corresponding elements configured to perform corresponding functions. For example, as illustrated in Figure 10, at block 1000, a decoder may comprise means, such as the processing circuitry 12, for receiving a point cloud compression coded bitstream, wherein the point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream. As illustrated in block 1002, the decoder may further comprise means, such as the processing circuitry 12, for decoding the point cloud compression coded bitstream.

[00202] In the above, where the example embodiments have been described with reference to syntax and semantics, certain embodiments likewise cover an encoder that outputs a bitstream portion according to the syntax and semantics. Likewise, certain embodiments correspondingly cover a decoder that decodes a bitstream portion according to the syntax and semantics.

[00203] An example embodiment of the invention described above describes the codec in terms of separate encoder and decoder apparatus in order to assist the understanding of the processes involved. However, it would be appreciated that the apparatus, structures and operations may be implemented as a single encoder-decoder apparatus/structure/operation. Furthermore, it is possible that the coder and decoder may share some or all common elements.

[00204] Although the above examples describe certain embodiments performed by a codec within an apparatus, it would be appreciated that other embodiments may be implemented as part of any video codec. Thus, for example, certain embodiments may be implemented in a video codec which may implement video coding over fixed or wired communication paths.

[00205] As described above, Figures 4-10 includes flowcharts of an apparatus 10, method, and computer program product according to certain example embodiments. It will be understood that each block of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by various means, such as hardware, firmware, processor, circuitry, and/or other devices associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described above may be embodied by computer program instructions. In this regard, the computer program instructions which embody the procedures described above may be stored by a memory device 14 of an apparatus employing an embodiment of the present invention and executed by processing circuitry 12 of the apparatus. As will be appreciated, any such computer program instructions may be loaded onto a computer or other programmable apparatus (e.g., hardware) to produce a machine, such that the resulting computer or other programmable apparatus implements the functions specified in the flowchart blocks. These computer program instructions may also be stored in a computer-readable memory that may direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture, the execution of which implements the function specified in the flowchart blocks. The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowchart blocks.

[00206] A computer program product is therefore defined in those instances in which the computer program instructions, such as computer-readable program code portions, are stored by at least one non-transitory computer-readable storage medium with the computer program instructions, such as the computer-readable program code portions, being configured, upon execution, to perform the functions described above, such as in conjunction with the flowcharts of Figures 4-10. In other embodiments, the computer program instructions, such as the computer-readable program code portions, need not be stored or otherwise embodied by a non-transitory computer-readable storage medium, but may, instead, be embodied by a transitory medium with the computer program instructions, such as the computer-readable program code portions, still being configured, upon execution, to perform the functions described above.

[00207] Accordingly, blocks of the flowcharts support combinations of means for performing the specified functions and combinations of operations for performing the specified functions for performing the specified functions. It will also be understood that one or more blocks of the flowcharts, and combinations of blocks in the flowcharts, may be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.

[00208] In some embodiments, certain ones of the operations above may be modified or further amplified. Furthermore, in some embodiments, additional optional operations may be included, such as represented by the blocks outlined in dashed lines in Figures 2-4.

Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.

[00209] Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Moreover, although the foregoing descriptions and the associated drawings describe example embodiments in the context of certain example combinations of elements and/or functions, it should be appreciated that different combinations of elements and/or functions may be provided by alternative embodiments without departing from the scope of the appended claims. In this regard, for example, different combinations of elements and/or functions than those explicitly described above are also contemplated as may be set forth in some of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

THAT WHICH IS CLAIMED:

1. A method comprising:

accessing a point cloud compression coded bitstream, wherein the point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream; and

causing storage of the point cloud compression coded bitstream.

2. A method according to Claim 1 wherein causing storage of the point cloud compression coded bitstream comprises:

accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream;

detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure;

multiplexing one or more coded access units of the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream such that geometry, texture, occupancy-map, and dynamic metadata required for reconstructing one frame are grouped; and

causing construction and storage of a video media track, wherein the video media track comprises the multiplexed one or more coded access units.

3. A method according to Claim 1 wherein causing storage of the point cloud compression coded bitstream comprises:

storing the texture information bitstream in a texture video media track;

storing the geometry information bitstream in a geometry video media track;

storing the auxiliary metadata information bitstream in a timed metadata track; and causing construction and storage of an interleaved indicator metadata file indicating an interleaving pattern for one or more components of the texture video media track, the geometry video media track, and the timed metadata track.

4. An apparatus comprising at least one processor and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:

access a point cloud compression coded bitstream, wherein the point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream; and

cause storage of the point cloud compression coded bitstream.

5. An apparatus according to Claim 4 wherein causing storage of the point cloud compression coded bitstream comprises:

6. An apparatus according to Claim 4 wherein causing storage of the point cloud compression coded bitstream comprises:

storing the texture information bitstream in a texture video media track;

storing the geometry information bitstream in a geometry video media track;

7. A computer program product comprises at least one non-transitory computer- readable storage medium having computer executable program code instructions stored therein, the computer executable program code instructions comprising program code instructions configured, upon execution, to: access a point cloud compression coded bitstream, wherein the point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream; and

cause storage of the point cloud compression coded bitstream.

8. A computer program product according to Claim 7 wherein causing storage of the point cloud compression coded bitstream comprises:

9. A computer program product according to Claim 7 wherein causing storage of the point cloud compression coded bitstream comprises:

storing the texture information bitstream in a texture video media track;

storing the geometry information bitstream in a geometry video media track;

10. A method comprising :

receiving a point cloud compression coded bitstream, wherein the point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream; and

decoding the point cloud compression coded bitstream.

11. A method according to Claim 10 wherein the point cloud compression coded bitstream is structured by at least:

12. A method according to Claim 10 wherein the point cloud compression coded bitstream is structured by at least:

storing the texture information bitstream in a texture video media track;

storing the geometry information bitstream in a geometry video media track;

13. An apparatus comprising at least one processor and at least one memory including computer program code for one or more programs, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:

receive a point cloud compression coded bitstream, wherein the point cloud compression coded bitstream comprises a texture information bitstream, a geometry information bitstream, and an auxiliary metadata bitstream; and

decode the point cloud compression coded bitstream.

14. An apparatus according to Claim 13 wherein the point cloud compression coded bitstream is structured by at least: accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream;

15. An apparatus according to Claim 13 wherein the point cloud compression coded bitstream is structured by at least:

storing the texture information bitstream in a texture video media track;

storing the geometry information bitstream in a geometry video media track;

16. A computer program product comprises at least one non-transitory computer- readable storage medium having computer executable program code instructions stored therein, the computer executable program code instructions comprising program code instructions configured, upon execution, to:

decode the point cloud compression coded bitstream.

17. A computer program product according to Claim 16 wherein the point cloud compression coded bitstream is structured by at least:

accessing the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream; detecting that the texture information bitstream, the geometry information bitstream, and the auxiliary metadata bitstream follow a compatible coding and timing structure;

18. A computer program product according to Claim 16 wherein the point cloud compression coded bitstream is structured by at least:

storing the texture information bitstream in a texture video media track;

storing the geometry information bitstream in a geometry video media track;