CN117581551A

CN117581551A - Method, apparatus and computer program for dynamically encapsulating media content data

Info

Publication number: CN117581551A
Application number: CN202280046511.XA
Authority: CN
Inventors: 弗雷德里克·梅兹; 弗兰克·德诺奥; 简·勒菲弗
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-06-29
Filing date: 2022-06-24
Publication date: 2024-02-20

Abstract

At least one embodiment of a method for encapsulating media data, the method comprising: and encapsulating the portion of the media data or the set of information items related to the portion of the media data as entities in a media file, the entities being grouped into a set of entities associated with a first indication representing the parameter, wherein the media file comprises a second indication that signals to the client that the set of entities is to be parsed only if the client has knowledge about the first indication.

Description

Method, apparatus and computer program for dynamically encapsulating media content data

Technical Field

The present invention relates to a method, apparatus and computer program for improving the encapsulation and parsing of media data enabling improved handling of different configurations and organizations of encapsulated media data.

Background

International standards organization base media file formats (ISO BMFF, ISO/IEC 14496-12) are well known flexible and extensible file formats that describe coded timed or non-timed media data or bitstreams for local storage or for transmission via a network or via another bitstream delivery mechanism. The file format has several extensions, e.g., part-15, ISO/IEC 14496-15, which describe encapsulation tools for various NAL (network abstraction layer) unit based video coding formats. Examples of such coding formats are AVC (advanced video coding), SVC (scalable video coding), HEVC (high efficiency video coding), L-HEVC (layered HEVC), and VVC (versatile video coding). Another example of a file format extension is ISO/IEC 23008-12 describing packaging tools for still images or still image sequences, such as HEVC still images. Yet another example of a file format extension is ISO/IEC 23090-2, which defines the omni-directional media application format (OMAF). Other examples of file format extensions are ISO/IEC 23090-10 and ISO/IEC 23090-18, which define the transport of visual volume video based coded (V3C) media data and geometry based point cloud compressed (G-PCC) media data.

The file format is object oriented. It consists of building blocks called boxes (or data structures, each identified by four character codes (also denoted as FourCC or 4 CC)). A full box is a data structure similar to a box, further including version and flag value attributes. Hereinafter, the term box may designate a full box or both boxes. These blocks or full blocks are organized sequentially or hierarchically. They define parameters describing the encoded timed or non-timed media data or bit stream, their structure and associated timing (if present). Hereinafter, encapsulated media data is considered to specify encapsulated data including metadata and media data (the latter specifying an encapsulated bitstream). All data in the packaged media file (media data and metadata describing the media data) is contained in the box. No other data is present in the file. A file-level box is a box that is not contained in other boxes.

According to the file format, the overall presentation (or session) over time is referred to as animation. Animation is described in the top-level animated frame (identified by the four character code "moov") of a media or presentation file. The animated box represents an initialization information container containing a collection of various boxes describing the presentation. The frame may be logically partitioned into tracks represented by a rail box (identified by the four character code "trak"). Each track (uniquely identified by a track identifier (track_id)) represents a timing sequence (e.g., a sequence of video frames or a sequence of sub-portions of video frames) of media data related to a presentation. Within each track, each timing unit of media data is referred to as a sample. Such timing units may be video frames or sub-portions of video frames, audio samples, or sets of timing metadata. The samples are implicitly numbered in increasing decoding order. Each rail box contains a hierarchy of boxes describing samples of the respective rail. In this hierarchy of boxes, the sample table box (identified with the four character code "stbl") contains all entries of time information and data index for the media samples in the track. Furthermore, the "stbl" box contains in particular a sample description box (identified with the four character code "stsd") containing a set of sample entries, each sample entry giving the required information about the coding configuration of the media data in the sample (including identifying the coding type of the coding format and the various coding parameters characterizing the coding format), as well as any initialization information required to decode the sample. The actual sample data is stored in a box called a media data box (identified with four character codes "mdat") or a recognized media data box (identified with four character codes "imda", similar to the media data box but containing an additional identifier). The media data frame and the identified media data frame are at the same level as the animation frame.

The animation may also be segmented, i.e. organized in time as an animation frame containing information for overall presentation (followed by a list of animation segments), i.e. a list comprising pairs of animation segment frames (identified with a four character code "moof") and media data frames ("mdat"), or a list comprising pairs of animation segment frames ("moof") and identified media data frames ("imda").

FIG. 1 illustrates an example of encapsulated media data temporarily organized into segment presentations in one or more media files according to an ISO-based media file format.

Media data encapsulated in one or more media files 100 begins with a FileTypeBox ("ftyp") box (not shown) that provides a set of categories identifying the exact specifications to which the encapsulated media data conforms, which the reader uses to determine whether the reader can process the encapsulated media data. The "ftyp" box is followed by a MovieBox ("moov") box labeled 105. The MovieBox box provides initialization information required for the reader to initiate processing of the encapsulated media data. In particular, the box provides a description of the presentation content, the number of tracks, and information about the respective timelines and characteristics of the tracks. To illustrate, the MovieBox box may indicate to present one track including an identifier track_id equal to 1.

As shown, movieBox box 105 is followed by one or more animation segments (also referred to as media segments), each of which includes metadata stored in a MovieFragmentBox ("moof") box and media data stored in a MediaDataBox ("mdat") box. To illustrate, one or more media files 100 include a first animation segment that contains and describes samples 1 through N of a track identified with a track_ID equal to 1. The first animation segment is composed of a "moof" box 110 and an "mdat" box 115. Still for illustration, one or more media files 100 include a second animation segment that contains and describes samples n+1 through n+m of the track identified with track_id equal to 1. The second animation segment is composed of a "moof" box 120 and an "mdat" box 125.

When the encapsulated media data is segmented into a plurality of files, fileTypeBox and MovieBox boxes (hereinafter also referred to as initialization segments) are contained in an initial media file (also referred to as initialization segments) in which a track does not contain samples. The subsequent media file (also referred to as a media segment) contains one or more animation segments.

Among other information, the "moov" box 105 may contain a MovieExtendsBox ("mvex") box 130. When present, the information contained in the box alerts the reader that there may be subsequent animation segments and that these animation segments must be found and scanned in a given order to obtain all samples of the track. For this purpose, the information contained in this box should be combined with other information of the MovieBox box. MovieExtendsBox box 130 may contain an optional movieextendsheaterbox ("mehd") box and one TrackExtendsBox ("trex") box for each track defined in MovieBox box 105. When present, the movieextendsheadbox box provides the total duration of the segment animation. Each TrackExpendesBox box defines default parameter values used by the relevant tracks in the animation segment.

As shown, the "moov" box 105 also includes one or more TrackBox ("trak") boxes 135 that describe the various tracks in the presentation. The TrackBox box 135 contains in its box level a SampleTableBox ("stbl") box, which in turn contains descriptive and timing information for the media samples of the track. In particular, this box contains a SampleDescriptionBox ("stsd") box with one or more SampleEntry boxes that give descriptive information about the coding format of the samples (the coding format is identified with 4CC, as indicated with the "xxxx" character) and initialization information needed to configure the decoder according to the coding format.

For example, a SampleEntry box with four character types is set to "VVC1" or "vvi" that signals that the associated samples contain media data encoded according to the universal video coding (VVC) format, and a SampleEntry box with four character types is set to "hvc1" or "hev1" that signals that the associated samples contain media data encoded according to the High Efficiency Video Coding (HEVC) format. The SampleEntry box may contain other boxes containing information applicable to all samples associated with this SampleEntry box.

The sample is associated with the SampleEntry box via a sample_description_index parameter that is in a SampleToChunkBox ("stsc") box in the SampleTableBox ("stbl") box when the media file is a non-segmented media file, or in a trackfragmentheatbox ("tfhd") box in the TrackFragmentBox ("moof") box or in a TrackExtendsBox ("trex") box in the MovieExtendsBox when the media file is segmented.

According to the ISO base media file format, all tracks and all sample entries in the presentation are defined in the "moov" box 105 and cannot be declared later during the presentation.

It is observed that an animation segment may contain samples of one or more tracks declared in the "moov" box, but not necessarily for all tracks. The moviefmentbox 110 or 120 contains a TrackFragmentBox ("traf") box that includes a trackfragmentheadbox ("tfhd") box (not shown) that provides an identifier (e.g., track_id=1) that identifies the respective Track for which the sample is contained in the "mdat" box 115 or 125 of the animation fragment. Among other information, the "traf" box contains one or more TrackRunBox ("trunk") boxes that record a contiguous set of samples of the track in the animation segment.

The ISOBMFF file or segment may contain multiple sets of encoded timed media data (also representing bitstreams or streams) or sub-portions of sets of encoded timed media data (also representing sub-bitstreams or sub-streams) forming multiple tracks. When a sub-portion corresponds to one or a continuous spatial portion of a video source taken over time (e.g., at least one rectangular region (also known as a "block" or "sub-picture") taken over time), a corresponding plurality of tracks may be referred to as a block track or sub-picture track.

Note also that the ISOBMFF and its extensions include several grouping mechanisms to group together tracks, static items or samples, and associate group descriptions with groups. Groups typically share common semantics and/or characteristics. For example, movieBox box 105 and/or MovieFragmentBox 110 and 120 may contain a sample group that associates an attribute with a group of samples of a track. The sample group characterized by the packet type may be defined by two linked boxes as follows: a SampleToGroupBox ("sbgp") box representing the assignment of samples to sample groups, and a sampletogroupdescriptionbox ("sgpd") box containing sample group entries for each sample group describing the attributes of the group.

While these ISOBMFF file formats have proven to be efficient, there are some limitations to the dynamic session support of segmented ISOBMFF files. Thus, some functionality of the encapsulation mechanism that should be understood by the reader to parse and decode the encapsulated media data needs to be signaled to assist the reader in selecting the data to process.

Examples come from the core definition of the presentation, i.e. can only be defined in the initial animation box and cannot be updated later during the presentation. Another example is signaling about improving dynamic session support.

Thus, some functionality of the encapsulation mechanism that should be understood by the reader to parse and decode the encapsulated media data needs to be signaled to assist the reader in selecting the data to process.

Disclosure of Invention

The present invention is designed to solve one or more of the aforementioned problems.

According to a first aspect of the present invention there is provided a method for encapsulating media data, the encapsulated media data comprising metadata parts associated with the media data and media fragments, the method being performed by a server and comprising:

obtaining a first portion of media data, the first portion being organized into a first set of one or more media data tracks,

encapsulating metadata describing one or more tracks in the first set of one or more tracks in a metadata portion, and encapsulating the first portion of media data into one or more media segments,

obtaining a second portion of the media data, the obtained second portion being organized into a second set of one or more media data tracks,

if at least one track in the second set of one or more tracks is different from one or more tracks in the first set, the second portion of media data and metadata describing at least one track in the second set of one or more tracks are packaged into one media segment.

The method of the invention thus makes it possible to package media data dynamically without the need to fully describe the media data before starting the packaging of the media data.

According to some embodiments, the method further comprises:

obtaining a third portion of the media data, the obtained third portion being organized into a third set of one or more media data tracks, and

if at least one track of the second set of one or more tracks belongs to the third set of one or more media data tracks, the third portion of media data and metadata describing at least one track of the second set of one or more tracks are encapsulated into one media segment.

According to some embodiments, encapsulating the third portion of the media data and metadata describing at least one track in the second set of one or more tracks into one media segment comprises: metadata describing at least one track in the second set of one or more tracks is copied from encapsulating the media segments of the second portion to encapsulating the media segments of the third portion.

According to some embodiments, the metadata portion includes an indication that signals that the media segment may include a track different from one or more tracks in the first set of tracks.

According to some embodiments, the metadata of the media segments of the second portion of the packaged media data includes an indication that signals that the media segments of the second portion of the packaged media data include a track different from one or more tracks in the first set of tracks.

According to some embodiments, the method further comprises encoding the first portion of media data according to at least one first encoding configuration defined in the set of one or more first sample entries:

metadata describing one or more first sample entries is encapsulated in a metadata portion,

obtaining a fourth portion of the media data, the fourth portion of the media data being encoded according to at least one second encoding configuration, and

if the at least one second encoding configuration is different from the encoding configuration defined in the set of one or more first sample entries, a fourth portion of the media data and metadata describing the at least one second sample entry defining the at least one second encoding configuration are encapsulated into one media segment.

According to some embodiments, the fourth portion corresponds to the second portion.

According to some embodiments, the metadata portion includes an indication that signals that the media segment may include a sample entry different from a sample entry in the set of one or more first sample entries, or that the metadata of the media segment encapsulating the fourth portion of the media data includes an indication that signals that the metadata of the media segment encapsulating the fourth portion of the media data includes a sample entry different from a sample entry in the set of one or more first sample entries.

According to some embodiments, at least one track of the second set of one or more tracks comprises a reference to at least one other track, the at least one other track being described in a metadata portion or metadata of the media segment.

According to a second aspect of the present invention there is provided a method for parsing packaged media data, the packaged media data comprising metadata parts associated with the media data and media fragments, the method being performed by a client and comprising:

metadata describing one or more tracks in the first set of one or more tracks is obtained from the metadata portion,

a media fragment called a first media fragment is obtained,

parsing the first media segment to obtain metadata

If the metadata obtained from the first media segment describes at least one track that is different from one or more tracks in the first set, the first media segment is parsed to obtain a portion of the media data that is organized into a second set of one or more media data tracks that includes the at least one track.

The method of the invention thus makes it possible to parse the encapsulated media data that has been dynamically encapsulated without the need to fully describe the media data before starting the encapsulation of the media data.

According to some embodiments, the method further comprises

A second media segment is obtained and a second media segment,

parsing the second media segment to obtain metadata,

if the metadata of the second media segment includes the same description of at least one track different from one or more tracks in the first set as metadata of the first media segment, parsing the second media segment to obtain a portion of the media data, the media data obtained from the first portion of the media data and from the second portion of the media data belonging to the same at least one track.

According to some embodiments, the method further comprises obtaining an indication from the metadata portion, the indication signaling that the media segment may comprise a track different from one or more tracks in the first set of tracks.

According to some embodiments, the method further comprises obtaining an indication from the metadata of the obtained media segments, the indication signaling that the media segment comprising the indication comprises a track different from one or more tracks in the first set of tracks.

According to some embodiments, the method further comprises:

a third media segment is obtained and,

parsing the third media segment to obtain metadata

If the metadata of the third media segment includes metadata describing at least one sample entry defining at least one encoding configuration, parsing at least a portion of the media data of the third media segment according to the at least one encoding configuration to obtain media data.

According to some embodiments, the third portion corresponds to the first portion.

According to some embodiments, the method further comprises: an indication is obtained from the metadata portion that signals that the media segment may describe a different sample entry than the sample entry described in the metadata portion, or an indication is obtained from metadata of the obtained media segment that signals that the media segment comprising the indication describes a different sample entry than the sample entry described in the metadata portion.

According to a third aspect of the present invention, there is provided a method for encapsulating media data, the method being performed by a server and comprising:

identifying a portion of the media data or a collection of information items related to a portion of the media data based on parameters independent of the encapsulation, an

Encapsulating the portion of the media data or the collection of information items as entities in the media file, the entities being grouped into a collection of entities associated with a first indication of the presentation parameter,

wherein the media file comprises a second indication that signals to the client to parse the set of entities only if the client has knowledge about the first indication.

Thus, the method of the present invention allows some functionality of the encapsulation mechanism that should be understood by the reader to parse and decode the encapsulated media data to be signaled to assist the reader in selecting the data to process.

According to a fourth aspect of the present invention, there is provided a method for parsing encapsulated media data, the method being performed by a client and comprising:

determining that the encapsulated media data includes a second indication that signals parsing of the set of entities only if the client has knowledge of a first indication associated with the set of entities to parse,

a reference to the set of entities to be parsed is obtained,

obtaining a first indication associated with a set of entities that have obtained a reference, and

if the client does not have knowledge about the obtained first indication associated with the set of entities that have obtained the reference, the set of entities that have obtained the reference is ignored when parsing the packaged media data.

Thus, the method of the present invention allows a reader to understand some of the functionality of the encapsulation mechanism used to generate the encapsulated media data, to parse and decode the encapsulated media data, and to select the data to be processed.

According to some embodiments, the second indication also signals that the media data is to be rendered only when the client has knowledge about the first indication.

According to some embodiments, the entity is a sample or supplemental information describing the media data. The sample set may be, for example, a sample group, an entity group, a track, a sample entry, and the like.

According to some embodiments, the entity is a sample and the set of entities is a set of corrupted samples.

According to other aspects of the invention, there is provided a processing device comprising a processing unit configured to perform the steps of the above-described method. Other aspects of the present disclosure have optional features and advantages similar to those of the first, second, third and fourth aspects described above.

At least a part of the method according to the invention may be computer-implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, the invention can take the form of a computer program product embodied in any tangible expression medium having computer-usable program code embodied in the medium.

Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for providing to a programmable device on any suitable carrier medium. The tangible carrier medium may include storage media such as floppy diskettes, CD-ROMs, hard drives, magnetic tape devices, or solid state memory devices, among others. The transitory carrier medium may include signals such as electrical, electronic, optical, acoustic, magnetic, or electromagnetic signals (e.g., microwave or RF signals), etc.

Drawings

Embodiments of the present invention will now be described, by way of example only, with reference to the following drawings in which:

FIG. 1 illustrates an example of a structure of segmented media data encapsulated according to an ISO base media file format;

FIG. 2 illustrates an example of a system in which some embodiments of the invention may be implemented;

FIG. 3 illustrates an example of a segmented presentation encapsulated in one or more media files, where new tracks are defined in an animation segment according to some embodiments of the invention;

FIG. 4 illustrates an example of a segmented presentation encapsulated in one or more media files, wherein sample entries and/or tracks are defined in an animation segment according to some embodiments of the invention;

FIG. 5 is a block diagram illustrating an example of steps performed by a server or writer to encapsulate encoded media data according to some embodiments of the invention;

FIG. 6 is a block diagram illustrating an example of steps performed by a client or reader to process packaged media data according to some embodiments of the invention;

FIG. 7 is a block diagram illustrating an example of steps performed by a client or reader to obtain data according to some embodiments of the invention; and

fig. 8 schematically illustrates a processing device configured to implement at least one embodiment of the invention.

Detailed Description

According to some embodiments of the invention, track and/or sample entries may be dynamically signaled within an animation segment without being pre-described within an initialization segment.

Fig. 2 illustrates an example of a system in which some embodiments of the invention may be implemented.

As shown, the server or writer labeled 200 is connected to the communication network 230 via a network interface (not shown), and the communication network 230 is also connected to the client or reader 250 via a network interface (not shown) so that the server or writer 200 and the client or reader 250 can exchange media files labeled 225 via the communication network 230.

According to other embodiments, the server or writer 200 may exchange the media file 225 with the client or reader 250 via a storage device (e.g., a storage device labeled 240). Such a storage device may be, for example, a memory module (e.g., random Access Memory (RAM)), a hard disk, a solid state drive, or any removable digital medium (e.g., a disk or memory card).

According to the illustrated example, the server or writer 200 is intended to process media data (e.g., media data labeled 205), such as video data, audio data, and/or descriptive metadata, for streaming or storage purposes. To this end, the server or writer 200 obtains or receives media content (e.g., timing sequences of one or more images, timing sequences of audio samples, or timing sequences of descriptive metadata) comprising initial media data or bitstream 205, and encodes the obtained media data into encoded media data, labeled 215, using an encoder module, labeled 210 (e.g., a video encoder or an audio encoder). The server or writer 200 then encapsulates the encoded media data into one or more media files, labeled 225, containing the encapsulated media data using an encapsulation module, labeled 220. According to the illustrated example, the server or writer 200 includes at least one encapsulation module 220 to encapsulate the encoded media data. The encoder module 210 may be implemented within the server or writer 200 to encode the received media data or may be separate from the server or writer 200. The encoder module 210 is optional because the server or writer 200 may encapsulate media data previously encoded in a different device or may encapsulate raw media data.

The encapsulation module 220 may generate a media file or multiple media files. The media file or files correspond to encapsulated media data and/or successive segments of encapsulated media data containing alternate versions of the media data.

Still according to the illustrated embodiment, a client or reader 250 is used to process the packaged media data to display or output the media data to a user.

As shown, a client or reader 250 obtains or receives one or more media files, such as media file 225, via a communication network 230 or from a storage device 240. Upon obtaining or receiving the media file, the client or reader 250 parses and decapsulates the media file using a decapsulation module labeled 260 to retrieve the encoded media data labeled 265. The client or reader 250 then decodes the encoded media data 265, using a decoder module, labeled 270, to obtain media data, labeled 275, representing audio and/or video content (signals) that may be processed by the client or reader 250 (e.g., drawn or displayed to a user by a dedicated module, not shown). Note that the decoder module 270 may be implemented within the client or reader 250 to decode the encoded media data, or may be separate from the client or reader 250. The decoder module 270 is optional because the client or reader 250 may receive media files corresponding to the encapsulated original media data.

Note here that the media file or files (e.g., media file 225) may be communicated to the decapsulation module 260 of the client or reader 250 in a variety of ways. For example, the media file may be pre-generated by the encapsulation module 220 of the server or writer 200 and stored as data in a remote storage device (e.g., on a server or cloud storage) or a local storage device such as the storage 240 in the communication network 230 until the user requests the media file encoded therein from the remote or local storage device. Upon request for a media file, data is read, communicated or streamed from the storage device to the decapsulation module 260.

The server or writer 200 may also include a content providing device for providing or streaming content information to a user that points to media files stored in a storage device (e.g., the content information may be described via a manifest file (e.g., a Media Presentation Description (MPD) compliant with the ISO/IEC MPEG-DASH standard, or an HTTP Live Streaming (HLS) manifest) that includes, for example, a title of the content and other descriptive metadata and storage location data for identifying, selecting, and requesting the media files). The content providing device may also be adapted to receive and process a user's request for a media file to be transferred or streamed from the storage device to the client or reader 250. Alternatively, the server or writer 200 may use the encapsulation module 220 to generate a media file or media files and communicate or stream the media files directly to the client or reader 250 and/or the decapsulation module 260 when the user requests content.

The user may access audio/video media data (signals) through a user interface of a user terminal including a client or reader 250 or a user terminal having a device in communication with the client or reader 250. Such a user terminal may be a computer, a mobile phone, a tablet computer or any other type of device capable of providing/displaying media data to a user.

To illustrate, a media file or media files, such as media file 225, represent encoded media data (e.g., one or more timing sequences of encoded audio or video data) encapsulated into a box according to an ISO base media file format (ISOBMFF, ISO/IEC 14496-12, and ISO/IEC 14496-15 standards). The media file or files may correspond to a single media file (prefixed by a FileTypeBox "ftyp" box) or to an initialization segment file (prefixed by a FileTypeBox "ftyp" box), where the file is followed by one or more media segment files (possibly prefixed by a segment typebox "styp" box). According to ISOBMFF, a media file (and a segment file (if present)) may include two kinds of boxes: a "media data" box ("mdat" or "mda") containing encoded media data, and a "metadata box" (a "moov" or "moof" or "meta" box hierarchy) containing metadata defining the placement and timing of the encoded media data.

An encoder or decoder module (labeled 210 and 270, respectively, in fig. 2) encodes and decodes image or video content using an image or video standard. For example, image or video encoding/decoding (codec) standards include ITU-T H.261, ISO/IEC MPEG-1Visual, ITU-T H.262 (ISO/IEC MPEG-2 Visual), ITU-T H.263 (ISO/IEC MPEG-4 Visual), ITU-T H.264 (ISO/IEC MPEG-4 AVC) (including Scalable Video Coding (SVC) and Multiview Video Coding (MVC) extensions thereof), ITU-T H.265 (HEVC) (including Scalable (SHVC) and multiview (MV-HEVC) extensions thereof), or ITU-T H.VVC (ISO/IEC MPEG-I universal video coding (VVC)). The techniques and systems described herein may also be applicable to other coding standards that have been or have not been available or developed.

FIG. 3 illustrates an example of segment presentation encapsulated in one or more media files, where new tracks are defined in an animation segment according to some embodiments of the invention.

According to a particular embodiment, the ISOBMFF is extended to allow new tracks to be defined in an animation segment (also referred to as a media segment), which were not previously defined in an initialization segment (e.g., in MovieBox ("moov") box 300). Such a new track may be referred to as a "dynamic track", i.e. a track that appears along the presentation and does not exist at the beginning of the presentation (e.g. is not declared in the MovieBox box). The dynamic track may have the duration of one animation segment or may span several animation segments. In contrast to dynamic tracks, the tracks declared in the MovieBox box may be referred to as "static tracks".

MovieBox box 300 provides a description of a presentation initially made up of at least one track defined by a TrackBox ("trak") box 305 having a track identifier, e.g., track identifier track_id equal to 1, the TrackBox box 305 indicating a sample sequence having a coding format described by the sample entry shown with 4cc "xxxx".

According to the illustrated example, the MovieBox box 300 is followed by two animation segments (or media segments). The first animation segment consists of a moviefragment box labeled 310 ("moof") containing metadata and a MediaDataBox ("mdat") box 320 containing media data (or encoded media data). More precisely, as shown, this first animation segment contains N samples (samples 1 through N) stored in an "mdat" box 320, which belong to a track segment described within a TrackFragmentBox ("traf") box having a track identifier track_ID equal to 1 in a MoveFragmentBox ("moof") box labeled 310. Samples 1 through N are associated with the sample entry shown in 4CC "xxxx" via a parameter in the TrackFragmentHeadEdBox ("tfhd") box denoted as sample_description_index or by default via a parameter in the TrackExtendBox ("trex") box denoted as default_sample_description_index.

As shown, in the second animation segment, a new or dynamic track (having a track identifier track_ID equal to 2) is declared in its movietrack box labeled 330. The new track or dynamic track is not defined in MovieBox box 300 (i.e., in the initialization segment). According to this example, the second animation segment contains samples for two tracks in MediaDataBox ("mdat") 340: samples n+1 to n+m belonging to tracks having a track identifier track_id equal to 1 are associated with a sample entry "xxxx" via a parameter denoted sample_description_index in a trackfragmentheader ("tfhd") box having a track identifier equal to 1 or by default via a parameter denoted default_sample_description_index in a TrackExtendsBox ("trex") box having a track identifier equal to 1; and samples n+m+1 to n+m+z belonging to a new track having a track identifier track equal to 2, which are associated with a sample entry "yyyy" via a parameter denoted sample_description_index in a trackfragmentheader ("tfhd") box with track_id equal to 2.

The new or dynamic track in the second animation segment is declared by defining a TrackBox ("trak") box (here "trak" box 345) in the "moof" box 330.

The same mechanism can be used to define a new track or dynamic track in any animation segment in the presentation.

According to some embodiments, the definition of the "trak" box is modified to authorize its definition in the movietrack box ("moof") as follows:

Box Type:'trak'

Container:MovieBox or MovieFragmentBox

Mandatory:Yes

Quantity:One or more

when a TrackBox ("trak") box is present in the moviefragmentBox ("moof") box, it is preferably located before any TrackFragmentBox ("traf") boxes, for example, before the "traf" boxes 350 and 355. The track defined in this case is valid only for the validity period of the animation segment. The track defined in this way should use a track identifier (track_id) different from the track identifier (track_id) of any track defined in the MovieBox box (i.e., in the initialization segment or section), but may have the same track identifier (track_id) as the track identifier (track_id) of a track defined in the previous animation segment to identify the succession of the same track. In this case, the corresponding TrackBox is identical (bit-to-bit). To span multiple segments, multiple assertions of a new track or dynamic track are required from one segment to another to ensure random access to the individual track segments. The continuity of one or more new or dynamic tracks may involve some animation segments corresponding to a period of time, but not necessarily all animation segments covering the period of time. In a variant, to determine the continuity of the new track or dynamic track over two animation segments (not necessarily continuous), the parser or reader may simply compare the TrackBox payloads: if they are equal (bit in place), then the parser or reader can safely be considered the same track and some decoder reinitialization can be avoided. If they are not equal (even if the same track identifier is used), the parser or reader should be considered a different track than the previous one. Preferably, in case there is no continuity between the tracks on different animation segments, it may be suggested to assign different track identifiers to the new track or the dynamic track. More generally, the track identifier of the new track or the dynamic track should not conflict with any other track identifier in the MovieBox box (for this purpose the writer may use the next track identifier value). When a "unit" class encapsulation file is used, the track identifier for the new track or dynamic track should not conflict with any identifier of the track, group of tracks, group of entities, or item. Preferably, a reserved range of identifiers in the 32 bits allocated for the track identifier may be used. For example, track identifiers above 0x10000000 may be considered reserved for new tracks or dynamic tracks to avoid any collision with static tracks (i.e., tracks declared in the MovieBox ("moov") box).

Note that the "trak" box defined in the animation segment should be empty (i.e., the "trak" box does not define any samples, and forced boxes such as the timeto samplebox ("stts") box, sampleToChunkBox ("stsc") box, and ChunkOffsetBox ("stco") box that define sample timing or offset should have no entry (entry_count=0)), should contain at least one sample entry (shown by a sample entry having 4cc "yyyy") that describes the coding format of samples belonging to the track and contained in the animation segment, and should have no SampleGroupDescriptionBox ("sgpd") box defined. The samples of the track are implicitly segmented. The duration in the TrackHeaaderBox ("tkhd") box of the TrackBox ("trak") box in the animation segment should be 0.

According to some embodiments, the moviefmentbox ("moof") box contains a TrackFragmentBox ("traf") box for storing the respective tracks of the associated samples in the associated MediaDataBox ("mdat") box, that is, for the tracks defined within the initialization segment and for the tracks defined in the animation segment under consideration, such as the new tracks defined in the "trak" box 345. Thus, according to the example shown, the "moof" box 330 contains two TrackFragmentBox ("traf") boxes, one for a track with a track identifier track_id equal to 1 ("traf" box 350) and the other for a new track with a track identifier track_id equal to 2 ("traf" box 355).

The trackfragmentheadbox ("tfhd") box referring to the "traf" box of the new track that was not previously defined in the MovieBox ("moov") box preferably has a flag set in the tf_flags parameter indicating that information that should be present in the trackfragmentheadbox ("tfhd") box of the new track (because there is no associated TrackExtendsBox ("trex") box declared in the MovieBox ("moov") box for the new track that would provide the corresponding default value) is not present: sample-description-index-present, default-sample-duration-present, default-sample-size-presentation, and default-sample-flags-presentation.

Note that by defining the corresponding box in the "trak" box (e.g., in the "trak" box 345), e.g., by using the trackReferenceBox ("tref") box or the trackGroupBox ("trgr") box, respectively, any new track defined in the animation segment may benefit from all common track mechanisms, e.g., may refer to another track or may belong to a group of tracks. According to some embodiments, a static track (i.e., the track initially defined in the MovieBox box) cannot reference a dynamic track (i.e., the track defined in the moviefragtox box) by a track reference box, but a dynamic track may reference a static track. Assuming that a dynamic track and another dynamic track are declared in the same animation segment, the dynamic track may reference the other dynamic track through a track reference box (when the referenced track does not exist, the reference is ignored).

In variations, a new track or dynamic track may be defined by using a dedicated box with a new 4CC (e.g., a TemporaryTrackBox box with a 4CC "ttrk" or a DynamicTrackBox box with a 4CC ("dntk"). Such a special box may be a lightweight version of the TrackBox hierarchy. The special box may contain at least a track identifier specifying whether the track is a media handler type of video, audio or metadata track, a description of the sample content that makes up the track, and a data reference indicating the sample location (i.e., whether the sample is located in a MediaDataBox ("mdat") box or in a identify MediaDataBox ("mda") box). For example, the specialized box may include a trackHeadbox ("tkhd") box that provides a track identifier, a HandlerBox ("hdlr") box that provides a media handler type, a sampleDescriptionBox ("stsd") box that provides a description of sample content, and a DataReferenceBox ("dref") box that indicates a sample location (e.g., in a MediaDataBox ("mdat") box or an identify MediaDataBox ("mda") box).

In another variation, a new track or dynamic track may be defined by expanding the TrackFragmentBox ("traf"). A new version of the box (e.g., version=1) or a new flag value of its trackfragmentheader box ("tfhd") box (e.g., new flag value trak_in_moof in the tf_flags parameter) may be defined to signal to the reader that the TrackFragmentBox ("traf") box corresponds to a new track that was not previously defined in the MovieBox box. When the new version of the "tfhd" box is used, or when the new flag value trak_in_moof is set in the "tfhd" box, the "traf" box contains a MediaInformationBox ("mdia") box that provides at least a description of the media handler type, the samples that make up the track, without defining the samples (i.e., without defining the timing and offset of the samples, which are defined within the TrackRunBox ("trunk") box), and a data reference (e.g., in the MediaDataBox ("mdat") box or in the identify MediaDataBox ("imda") box) that indicates the location of the samples.

In another variation, when the new version of the "tfhd" box is used, or when the new tf_flags value trak_in_moof is set in the "tfhd" box, the "traf" box may contain some of the following boxes: mediaHeaderBox ("mdhd"), handlebox ("hdlr"), datarefencebox ("dref"), and/or SampleDescriptionBox ("stsd").

Still according to the above-described embodiments and variations thereof, a reader may be signaled to provide a possibility to define a new track or dynamics in an animation segment that was not previously defined in the MovieBox ("moov") box by defining a new DynamicTracksConfigurationBox ("dytk") box (e.g., the "dytk" box 330) in the MovieExtendsBox ("mvex") box (e.g., in the "mvex" box 360 in the "moov" box 300).

The "dytk" box may be defined as follows:

BoxType:'dytk'

Container:MovieExtendsBox

Mandatory:Yes if TrackBox is present in movie fragments

Quantity:Zero or one

aligned(8)class DynamicTracksConfigurationBox

extends FullBox('dytk',0,0){}

according to some embodiments, the presence of the box indicates that tracks not previously defined in MovieBox may be declared or authorized in the animation segment. In contrast, the absence of this box indicates that no other tracks than the tracks previously defined within the MovieBox box may be present in any MovieFragmentBox ("moof") box, i.e., no new tracks are declared in any MovieFragmentBox ("moof").

In a variation, by defining a new flag value such as the track_in_moof flag in the MovieExtendsBox ("mehd") box of the MovieExtendsBox ("mvex") box (e.g., in the "mvex" box 360), the reader is signaled the possibility of defining a new track (e.g., a dynamic track) in the animation segment that was not previously defined in the MovieBox box, e.g., as shown below:

track_in_moof: the flag mask is 0x000001:

when set, the indication may define a track within the animation segment that was not previously defined in the MovieBox (i.e., for example, a TrackBox may be declared within the moviefragment box ("moof"),

when unset, other tracks than the track previously defined in the MovieBox box (i.e., static tracks) may not be defined in any moviefragdox ("moof") box (i.e., for example, trackBox should not exist in any moviefragdox ("moof") box).

In another variation, the flag value track_in_moof is preferably defined in a movieheader ("mvhd") box in a MovieBox ("moov") box (not shown in fig. 3), rather than in a movieextendsheader ("mehd") box. This avoids indicating an optional movieextendsHeadsBox ("mehd") box that is sometimes not available in the derived specification (e.g., in the Common Media Application Format (CMAF) when the duration of the animation segment is unknown).

An advantage of the above-described embodiments and variations thereof is that the writer no longer needs to know the worst possible cases (i.e., all possible tracks and their configurations) in advance to produce an initial MovieBox ("moov") box. When a new track or dynamic track becomes available for the duration of the animation segment, the new track or dynamic track may be introduced (e.g., adding one or more media streams corresponding to additional languages or subtitles, or additional camera views).

FIG. 4 illustrates an example of segment presentation encapsulated in one or more media files, wherein sample entries and/or tracks are defined in an animation segment according to some embodiments of the invention.

According to the embodiment described with reference to fig. 4, in addition to making it possible to define new or dynamic tracks in an animation segment as described with reference to fig. 3, the ISOBMFF is extended to allow sample entries to be defined in the animation segment for tracks defined in a MovieBox ("moov") box, which were not previously defined in the MovieBox ("moov") box. Such sample entries present in the track segment may be referred to as dynamic sample descriptions or dynamic sample entries.

As shown, the MovieBox ("moov") box labeled 400 provides a description of a presentation initially consisting of at least one track defined by the TrackBox ("trak") box labeled 405, having a track identifier track_id equal to 1, and declaring a sample sequence having a coding format described by the sample entry shown with 4cc "xxxx".

According to the illustrated example, a "moov" box 400 is followed by two animation segments. The first animation segment consists of a moviefragment box labeled 410 ("moof") and a MediaDataBox ("mdat") box labeled 420. The first animation segment defines a new sample entry shown with 4cc "zzzz" for a track with a track identifier track_id equal to 1. The first animation segment stores samples in an "mdat" box 420 for a track segment described by a TrackFragmentBox ("traf") having a track identifier track_id equal to 1 in an "moof" box 410.

The second animation segment defines in its movietrack box labeled 430 a new track or dynamic track (i.e., a track having a track identifier track_id equal to 2) that is not defined in the "moov" box 400. The second animation segment stores samples for two tracks in its MediaDataBox ("mdat") box labeled 440: samples n+1 to n+m for tracks having a track identifier track_id equal to 1, and samples n+m+1 to n+m+z for new tracks having a track identifier track_id equal to 2.

According to some embodiments of the present invention, a new track or dynamic track or a new (or dynamic) sample entry of a track that has been defined in a MovieBox box may be defined in any animation segment in the presentation. The new track and the new sample entry may be defined in the same animation segment or in different animation segments. In addition, the same new track and/or the same new sample entry may be defined in different animation segments.

To support the definition of new sample entries in an animation segment, the ISOBMFF is extended to allow the sample description box to be declared not only within the SampleTableBox ("stbl") box of the TrackBox ("moov") box, but also within the TrackFragmentBox ("traf") box of the MovieFragmentBox ("moof") box as follows:

Box Types:'stsd'

Container:SampleTableBox or TrackFragmentBox

Mandatory:Yes

Quantity:Exactly one

thus, new sample entries with different encoding format parameters may be declared at the animation fragment level. This is useful, for example, when there is a change in codec configuration of the encoded media data or stream (e.g., an unexpected switch from AVC HD format to HEVC UHD format) or a change in content protection information (e.g., an unexpected switch from clear format to protected format). In fact, some application profiles, such as in the DVB specification, need to support multiple codecs. This means that codec changes may occur during multimedia presentation or programming. When this occurs in real-time programming (and is not known at the beginning), the dynamic sample entry allows signaling such codec changes. More generally, dynamic sample entries may be used to signal changes in coding parameters even if the codec remains unchanged.

Samples (e.g., samples 1 through N belonging to a track having a track identifier track_id equal to 1) are associated with a sample entry via a sample description index value. The range of values of the sample description index may be split into ranges such that one sample entry or another sample entry may be used. To illustrate, the range of values of the sample description index may be split into two ranges, for example, to use the sample entry defined in the "trak" box 405 or the sample entry defined in the "traf" box 415 as follows:

-a value from 0x0001 to 0x 10000: these values indicate the index of the sample entry (or sample description) contained in the SampleDescriptionBox ("stsd") box in the TrackBox ("trak") box, which corresponds to the indicated track_id in the TrackFragmentBox ("traf"); and

-a value of 0x10001- >0xFFFFFFFF: these values indicate an index of sample entries (or sample descriptions) in the SampleDescriptionBox ("stsd") box contained in the TrackFragmentBox ("traf") box, incremented by 0x10000, corresponding to the indicated track_id.

Thus, samples 1 through N stored in the "mdat" box 420 will be associated with a sample entry with 4CC "xxxx" defined in the "trak" box 405 or a sample entry with 4CC "zzzz" defined in the "traf" box 415, depending on the associated sample description index value.

When the SampleDescriptionBox ("stsd") box is present in the TrackFragmentBox ("traf") box, it is preferably immediately after the trackfragmentheader box ("tfhd") box. According to some embodiments, sample entries given in the SampleDescriptionBox ("stsd") box defined in the TrackFragmentBox ("traf") box of an animation fragment are valid only for the validity period of the animation fragment.

Still according to some embodiments, which may declare dynamic sample entries in an animation segment, a new track or dynamic track in a second animation segment (or any new track in any animation segment) may be declared as described with reference to fig. 3.

In the embodiment shown in FIG. 4, the new track is declared by defining a TrackBox ("trak") box labeled 445 in a movietrack box labeled 430.

To this end, the definition of the TrackBox ("trak") box may be modified to authorize its definition in MovieFragmentBox as follows:

Box Type:'trak'

Container:MovieBox or MovieFragmentBox

Mandatory:Yes

Quantity:One or more

when a TrackBox ("trak") box (e.g., "trak" box 445) is present in a MovieFragmentBox ("moof") box (e.g., "moof" box 430), it is preferably placed before any TrackFragmentBox ("traf") boxes (e.g., "trak" box 445 is placed before "traf" boxes 450 and 455). The tracks defined in this case are valid only for the validity period of the corresponding animation segment. Such a new track or dynamic track should not have the same track identifier (track_id) as any track defined in the MovieBox box, but may have the same track identifier as tracks defined in one or more other animation segments to identify the succession of the same track, in which case the trackboxes are identical (in place).

Furthermore, to span multiple segments, multiple assertions of a new track or dynamic track are required from one segment to another to ensure random access to the individual track segments. The continuity of one or more new or dynamic tracks may involve some animation segments corresponding to a period of time, but not necessarily all animation segments covering the period of time. In a variant, to determine the continuity of the new track or dynamic track over two animation segments (not necessarily continuous), the parser or reader may simply compare the TrackBox payloads: if they are equal (bit in place), then the parser or reader can safely be considered the same track and some decoder reinitialization can be avoided. If they are not equal (even if the same track identifier is used), then the parser or reader should consider this to be a different track than the previous one. Preferably, in case there is no continuity between the tracks on different animation segments, it may be suggested to assign different track identifiers to the new track or the dynamic track. More generally, the track identifier of the new track or the dynamic track should not conflict with any other track identifier in the MovieBox box (for this purpose the writer may use the next track identifier value). When a "unit" class encapsulation file is used, the track identifier for the new track or dynamic track should not conflict with any identifier of the track, group of tracks, group of entities, or item. Preferably, a reserved range of identifiers in the 32 bits allocated to the track identifier may be used. For example, a track identifier above 0x10000000 may be considered reserved for new tracks or dynamic tracks to avoid any conflict with static tracks (i.e., tracks declared in the MovieBox ("moov") box).

Again, note that the TrackBox ("trak") box defined in the animation segment, such as the "trak" box 445, etc., should be empty (i.e., the "trak" box does not define any samples, and the mandatory box, such as the timesamplebox ("stts") box, sampleToChunkBox ("stsc") or chunksetbox ("stco") box, etc., defining the timing or offset of the samples should have no entry (entry_count=0)), should have an empty SampleDescriptionBox ("stsd") box (e.g., "stsd" box 460), and should have no SampleGroupDescriptionBox ("sgpd") defined. Samples of the track are implicitly segmented. The duration in the TrackHeaaderBox ("tkhd") box (e.g., "trak" box 445) in the animation segment should be 0.

According to some embodiments, the MovieFragment ("moof") box contains a TrackFragmentBox ("traf") box for storing the respective tracks of the associated samples in the associated MediaDataBox ("mdat") box, i.e., for tracks defined within the initialization segment and for tracks defined in the animation segment under consideration, such as the new track defined in the "trak" box 445. Thus, according to the illustrated example, the "moof" box 430 contains two TrackFragmentBox ("traf") boxes, one for a track having a track identifier track_id equal to 1 ("traf" box 450), and one for a new track or dynamic track having a track identifier track_id equal to 2 ("traf" box 455).

Again, the trackfragmentheadbox ("tfhd") box referring to the "traf" box of the new track not previously defined in the MovieBox box preferably has the following flags set in tf_flags (description of track fragment attributes by flag list): sample-description-index-present, default-sample-duration-present, default-sample-size-presentation, default-sample-flags-presentation. These flags are indicated because for this new or dynamic track, no associated TrackExtendsBox ("trex") box provides default values for samples (n+m+1 through n+m+z) in the track segment.

When a TrackBox ("traf") box (e.g., "traf" box 445) is declared in a movief tragmentbox ("moof") box (e.g., "moof" box 430), a SampleDescriptionBox ("stsd") box (e.g., "stsd" box 470) should be declared in a TrackFragmentBox ("traf") box (e.g., "traf" box 455) having the same track identifier as the TrackBox ("trak" box 445).

Again, note that by defining the corresponding box in a TrackBox ("trak") box (e.g., in a "trak" box 445), for example by using a TrackReferenceBox ("tref") box or a TrackGroupBox ("trgr") box, respectively, any new track defined in the animation segment may benefit from all usual track mechanisms, e.g., may refer to another track or may belong to a group of tracks. According to some embodiments, a static track (i.e., a track originally defined in a MovieBox box) cannot reference a dynamic track (i.e., a track defined in a movietrack box) through a track reference box, but a dynamic track may reference a static track. Assuming that a dynamic track and another dynamic track are declared in the same animation segment, the dynamic track may also reference the other dynamic track through a track reference frame (when the referenced track does not exist, the reference is ignored).

In a variant, a new track or dynamic track may be defined by using a dedicated box with a new 4CC (e.g., a TemporaryTrackBox with a 4CC "ttrk" or a DynamicTrackBox with a 4CC ("dntk"). Such a special-purpose box may be a lightweight version of the TrackBox hierarchy of the box. The special-purpose block may comprise at least: a track identifier; a media handler type specifying whether the track is a video, audio or metadata track; and a data reference indicating the location of the sample (i.e., whether the sample is located in a MediaDataBox ("mdat") box or in an identify MediaDataBox ("mda") box). In this variation, a description containing samples that make up a track is not required, as the description will be provided by a sample entry (e.g., sample entry "yyyy") declared in a SampleDescriptionBox ("stsd") box (e.g., stsd box 470) having the same track identifier (e.g., traf box 455).

It will be apparent to those skilled in the art that other embodiments or variations of defining new or dynamic tracks described with reference to fig. 3 are applicable thereto.

Still according to the above-described embodiment and variations thereof, the reader may be signaled at the beginning of the file to provide the possibility to define new tracks and/or new sample entries in the animation segments that were not previously defined in the MovieBox. The indication may help the reader determine whether they can support the media file or only some of the tracks in the file. This may be accomplished, for example, by defining a new dynamictracksconfiguration box ("dytk") box (e.g., "dytk") box 480) in a MovieBox ("mvex") box (e.g., "mvex" box 490) (the name and 4cc are just examples) in a MovieBox ("moov") box (e.g., "moov" box 400).

The "dynk" box may be defined as follows:

BoxType:'dytk'

Container:MovieExtendsBox

Mandatory:Yes if TrackBox or SampleDescriptionBox are present in movie fragments

Quantity:Zero or one

aligned(8)class DynamicTracksConfigurationBox

extends FullBox('dytk',0,flags){}

according to some embodiments, this box may be used to signal the presence of a TrackBox ("trak") box (or similar box depending on the variant, e.g., the presence of a SampleDescriptionBox ("stsd") box) or a SampleDescriptionBox in the animation segment. For illustration, the following flag values may be used and defined as follows:

track_in_moof: the flag mask is 0x000001:

when set, the indication may declare a TrackBox ("trak") box within the animation segment,

when not set, the moviefragment box should not contain any TrackBox ("trak") box;

stsd_in_traf: the flag mask is 0x000002:

when set, the indication may declare a sampleDescriptionBox ("stsd") box within the TrackFragmentBux ("traf") box,

when not set, the TrackFragmentBox ("traf") box should not contain any sampleDescriptionBox ("stsd") box.

In a variation, the reader is signaled the possibility of defining a new track or dynamic track (i.e., a track not previously defined in the MovieBox box) in the animation segment by defining a new flag value (e.g., a new flag value track in moof) in the MovieExtendsBox ("mvex") box ("mehd") box 490 as follows:

track_in_moof: the flag mask is 0x000001:

when set, an indication may define a track in the animation segment that was not previously defined in the MovieBox (i.e., for example, a TrackBox may be declared within the MovieFragmentBox ("moof"),

when unset, other tracks than the track previously defined in the MovieBox box (i.e., static tracks) cannot be defined in any moviefragdox ("moof") box (i.e., for example, trackBox should not exist in any moviefragdox ("moof") box).

In another variation, the flag value track_in_moof is preferably defined in a movieheader ("mvhd") box in a MovieBox ("moov") box (not shown in fig. 3) rather than in a movieextendsheader ("mehd") box. This avoids indicating an optional movieextendsHeadsBox ("mehd") box that is sometimes not available in the derived specification (e.g., in Common Media Application Format (CMAF) when the duration of the animation segment is unknown).

Similarly, the reader may be signaled the possibility of defining a new sample entry in the animation segment (i.e., a sample entry not previously defined in the MovieBox box) by defining a new flag (e.g., a flag value stsd_in_traf) in the TrackExtendsBox ("trex") box of MovieExtendsBox ("mvex") box 490 as follows:

stsd_in_traf: the flag mask is 0x000001:

when set, the sampleDescriptionBox ("stsd") box may be signaled within a TrackFragmentBux ("traf") box having the same track identifier as that set in the associated TrackExtendsBox ("trex") box,

In another variation, the flag value stsd_in_traf is preferably defined in a SampleDescriptionBox ("stsd") box in a MovieBox ("moov") box, rather than in a TrackExtendsBox ("trex") box. According to this variation, the reader can directly know whether the sample description can be updated during presentation when parsing the SampleDescriptionBox ("stsd") box, without parsing the extra box in the MovieExtendBox ("mvex") box. For example, when the SampleDescriptionBox ("stsd") box has an unset flag value stsd_in_traf, the parser is guaranteed to declare all possible sample entries for a given track in the MovieBox ("moov") box portion of the file, i.e., in the initialization fragment. On the other hand, when the SampleDescriptionBox ("stsd") box has a flag value stsd_in_traf set, this indicates to the parser: some additional, new or dynamic sample entries may be defined later in the subsequent animation segment for the corresponding track.

In another variation, the reader is signaled the possibility of defining a new track and/or a new sample entry in the animation segment by defining a new DynamicTracksConfigurationBox ("dytk") box in a MovieBox ("moov") box, such as defining "dytk"480 in a "mvex" box 490 in a "moov" box 400.

The dynamicTracksConfigurationBox ("dynk") box may be defined as follows:

this box may be used to signal the existence of a TrackBox ("trak") box or SampleDescriptionBox ("stsd") box in the animation segment. For this purpose, the following flags may be defined:

track_in_moof: the flag mask is 0x000001:

all_packages_dynamic_stsd: the flag mask is 0x000002:

when set, the indication may declare a sampleDescriptionBox ("stsd") box within a TrackFragmentBux ("traf") box for any track defined in the animation,

when not set, the TrackFragmentBox ("traf") box that is not signaled in the "dynk" box under consideration should not contain any SampleDescriptionBox ("stsd").

In addition, the following semantics may be defined for the parameters of the dynamictrackconfiguration box ("dytk") box:

nb_locks gives the number of trackIDs listed;

the trackIDs indicate the track identifier of the track for which the sampleDescriptionBox ("stsd") box may be declared within the trackFragmentBux ("traf") box. If the all_tracks_dynamic_stsd flag is not set and no track is listed in this box, then no sampleDescriptionBox ("stsd") box should exist in any trackFragmentBox ("traf") box of this track.

Still further in accordance with the above-described embodiments, the reader may be further signaled the presence of a new track or dynamic track or new sample entry or dynamic sample entry in the animation segment by defining the following flag values in a movietrack header box ("mfhd") box in a movietrack box:

new-track-present (or dynamic-track-present): the flag mask is 0x000001:

when set, indicates that a new (or dynamic) track is declared within the animation segment,

when not set, indicating that a new (or dynamic) track is not declared within the animation segment;

stsd-present: the flag mask is 0x000001:

when set, indicates that a SampleDescriptionBox ("stsd") box is declared within the animation segment,

When not set, it indicates that a SampleDescriptionBox ("stsd") box is not declared within the animation segment.

When a signal within the MovieBox ("moov") box of a new (or dynamic) track or sample entry signals to the reader that a new track or new sample description in a subsequent media segment may have to be processed during presentation, this signal in the media segment allows the reader to signal whether the current animation segment actually contains a new track or sample description definition. Further, the moviefragmentheader ("mfhd") box in the MovieFragmentBox ("moof") box may be extended by asserting a new version (e.g., version=1) of the box and by adding a parameter next_track_id. The parameter will contain a next track identifier value that can be used to create a new or dynamic track in the next media segment. Typically containing a value 1 up to and including the current media segment greater than the maximum identifier value used at the file level found in the presentation. This enables the easy generation of a unique track identifier without knowing all previous media segments between the MovieBox ("moov") box and the last generated media segment.

According to some embodiments, the encapsulation module may use a category (defined at the file level in a FileTypeBox ("ftyp") box, at the segment level in a SegmentTypeBox ("styp"), or at the track level in a TrackTypeBox ("ttyp") box), an existing or new box, or even a field or flag value within these existing or new boxes to indicate that a new track or new sample entry may be declared in a media file or media segment. When the encapsulation module does not use such an indication, the parser does not have to check for the presence of a new track or sample entry. On the other hand, when such an indication is present, the parser according to the present invention should check for the presence of a new or dynamic track or sample entry at the beginning of the animation segment and eventually check for the continuation of the new or dynamic track.

Fig. 5 is a block diagram illustrating an example of steps performed by a server or writer to encapsulate encoded media data according to some embodiments of the invention.

These steps may be performed, for example, in the encapsulation module 220 in fig. 2.

As shown, a first step (step 500) is directed to obtaining a first portion of encoded media data, which may be comprised of one or more bitstreams representing an encoded timing sequence of video, audio, and/or metadata including one or more bitstream features (e.g., scalability layers, temporal sub-layers, and/or spatial sub-portions (such as HEVC blocks or VVC sub-pictures, etc.)). Potentially, multiple alternatives to the encoded media data, for example in terms of quality and resolution, may be obtained. Encoding is optional and the encoded media data may be the original media data.

From the first portion of encoded media data, the encapsulation module determines (step 505) a first set of tracks for encapsulating the first portion of encoded media data. It may be determined to use one track for each bitstream (e.g., one video bitstream one track, one audio bitstream one track, or two video bitstreams two tracks, one track for each video bitstream). Multiple bitstreams may also be multiplexed in one track (e.g., multiplexed audio-video tracks). The bitstream may also be split into multiple tracks (e.g., one track per layer or one track per spatial sub-picture). Additional tracks may also be defined that provide instructions to combine other tracks (e.g., a VVC base track for describing a combination of several VVC sub-picture tracks).

Next, at step 510, a description of the first set of determined tracks and associated sample entries is generated and encapsulated in an initialization segment that includes a MovieBox box, such as "moov" box 300 or "moov" box 400. Such a description may include defining one TrackBox box including one or more sample entries in the MovieBox box for each track in the first set of tracks, such as "trak" box 305 or "trak" box 405, and so forth. Furthermore, the encapsulation module signals in the MovieExtendsBox that some of the first set of tracks are segmented and that new tracks and/or new sample entries may be defined later in the animation segment according to the previous embodiments. Each timing unit of encoded media data corresponding to a first set of tracks is encapsulated in samples in each respective track in one or more animation segments (also denoted as first media segments), each media segment consisting of a moviefragment box and a MediaDataBox box.

Next, during step 515, the initialization segment and the first media segment may be output or stored, possibly as an initialization segment file (followed by a set of media segment files), each media segment file containing one or more media segments. Note that such step (step 515) is optional, as the initialization segment and the first media segment may be output later, for example, together with the second media segment.

Next, at step 520, the encapsulation module obtains a second portion of the encoded media data. The second portion of the encoded media data may be comprised of the same temporal succession of bitstreams as the bitstream for the first portion of the encoded media data. The temporal succession of bitstreams may be encoded in different coding formats or parameters (e.g., changing from AVC HD coding format to HEVC 4K coding format), different protection or encryption schemes, or different packing organizations (e.g., stereo or regional packing for omni-directional media). The second encoded media data may also consist of some additional bit streams not present in the first encoded media data (e.g. new sub-title or audio language bit streams or new video bit streams corresponding to new cameras or viewpoints or detected objects or regions of interest).

Next, at step 525, the encapsulation module determines a second set of tracks and associated sample entries for encapsulating the second portion of the encoded media data. When the second portion of the encoded media data is a simple temporal succession of the first portion of the encoded media data, it may be decided to reserve the same set of tracks for the second portion of the encoded media data. It may also be decided to change the number of tracks, for example to encapsulate the time-continuous of the first part of the encoded media data in different ways (e.g. a VVC bitstream comprising sub-pictures may be encapsulated in a single track for a first period of time and then in multiple tracks for a second period of time, e.g. a VVC base track and multiple VVC sub-picture tracks). A new track may also be added to encapsulate additional bitstreams of the second portion of the encoded media data, note that some of the additional bitstreams may be multiplexed with some of the pre-existing bitstreams. If a change occurs in the second portion of the encoded media data, e.g. if any information in the encoding format or codec profile and/or the protection scheme or the packing or sample description changes, a new sample entry may also be defined for the bitstream of the second portion.

Next, at step 530, it is determined whether the description of the second set of tracks is the same as the description of the first set of tracks declared in the MovieBox box. If the description of the second set of tracks is the same as the description of the first set of tracks, then a second portion of the encoded media data is encapsulated in an animation segment (i.e., a second media segment) on a standard basis using the tracks and sample entries previously defined in the MovieBox box (step 535).

Otherwise, if a new track exists in the second set of tracks as compared to the first set of tracks, the new track is defined in the animation segment. Furthermore, if there is a new sample entry associated with the second set of tracks as compared to the sample entry associated with the first set of tracks, the new sample entry is defined in the animation segment. Defining new tracks and/or new sample entries may be performed as described with reference to fig. 3 or fig. 4.

The description of the new track and/or new sample entry is encapsulated in a second media segment along with a second portion of the encoded media data (step 540).

Next, in step 545, the second media segment is output, for example, sent to a client via a communication network, or stored in a storage device. The media segments may be stored or transmitted as segment files or with the first media segment appended in an ISO base media file.

If there is still encoded media data to be packaged, as indicated by the dashed arrow, the process loops at step 520 until there is no more encoded media data to process.

Fig. 6 is a block diagram illustrating an example of steps performed by a client, parser, or reader to process packaged media data according to some embodiments of the invention.

These steps may be performed, for example, in decapsulation module 260 in fig. 2.

As shown, the first step (step 600) aims to obtain an initialization segment corresponding to the MovieBox box. The initialization segment may be obtained by parsing an initialization segment file received from the communication network or by reading a file on the storage device.

Next, in step 605, the obtained MovieBox box is parsed to obtain a first set of tracks corresponding to descriptions for all tracks of the presentation definition, and to obtain associated sample entries describing the encoding formats of the samples in the respective tracks. The decapsulation module 260 may determine from the category information or a particular box (e.g., "dytk" 480) in the initialization segment at the same step that the file or segments may contain dynamic (or new) track or dynamic (or new) sample entries.

Next, at step 610, processing and decoding of the encoded media data encapsulated in the track is initialized using the information items from the first set of tracks and using the associated sample entries. Typically, the media decoder is initialized with decoder configuration information present in the sample entries.

Next, in step 615, the decapsulation module obtains the animation segments (also denoted as media segments) by parsing the media segment files received from the communication network or by reading the media segment files on the storage device.

Next, in step 620, the decapsulation module determines a second set of tracks from the information obtained when parsing the MovieFragmentBox and TrackFragmentBox present in the obtained animation segment. In particular, it is determined whether one or more new tracks and/or one or more new sample entries are signaled and defined in the moviefasmentbox and/or the TrackFragmentBox according to the foregoing embodiments.

Next, at step 615, it is determined whether the second set of tracks and associated sample entries are different from the first set of tracks and associated sample entries. If the second set of tracks and associated sample entries are different from the first set of tracks and associated sample entries, the configuration of processing and decoding of the encoded media data is updated using information items obtained from MovieFragmentBox and/or TrackFragmentBox of the obtained media fragment (step 630). For example, a new decoder may be instantiated to process a new bitstream, or the decoder may be reconfigured to process a bitstream with a changed encoding format, codec profile, encoding parameters, protection scheme, or packaging organization.

After the configuration of processing and decoding of the encoded media data has been updated, or if the second set of tracks and associated sample entries are the same as the first set of tracks and associated sample entries, the encoded media data is decapsulated from the samples of the animation segments and processed (step 670), e.g., to be decoded and displayed or rendered to a user.

If there are more media segments to process, as indicated by the dashed arrow, processing loops to step 615 until there are no more media segments to process.

According to another aspect of the invention, the actual corrupted data of a sample or NALU (network abstraction layer (NAL) unit within a sample is still signaled in order to signal some function of the encapsulation mechanism that should be understood by the reader to parse and decode the encapsulated media data to help the reader select the data to process. For example, when data is received by an error-prone communication device, data corruption may occur. To signal corrupted data in the bitstream to be encapsulated, a new sample group description with grouping_type "corr" (or any other predetermined name) may be defined. The set of samples "corr" may be defined in any kind of track (e.g., video, audio, or metadata) to signal a corrupted or lost set of samples in the track. For illustration, the entries of the sample group description may be defined as follows:

Where coupled is a parameter indicating the corruption status of the associated data.

According to some embodiments, a value of 1 means that the data set is lost as a whole. In this case, the associated data size (sample size or NAL size) should be set to 0. A value of 2 means that the data is corrupted such that the data cannot be recovered by the recovery decoder (e.g., the slice header of the NAL is lost). A value of 3 means that the data is corrupted but can still be processed by the fault tolerant decoder. A value of 0 is reserved.

According to some embodiments, the associated grouping_type_parameter is not defined for CorruptedSampleInfoEntry. If some data is not associated with an entry in CorruptedSampleInfoEntry, this means that the data is not corrupted.

The SampleToGroup ("sbgp") box with grouping_type equal to "corr" allows CorruptedSampleInfoEntry to be associated with each sample and indicates whether the sample contains corrupted data.

This sample group description with grouping_type "corr" can also be advantageously combined within a NALU mapping mechanism consisting of a sampletogroup ("sbgp") box, a sample group description ("sgpd") box, both boxes having grouping_type "nalm" and a sample group description entry NALUMapEntry. The NALU mapping mechanism with the grouping_type_parameter set to "corr" allows signaling of corrupted NALUs in the samples. The groupID of the nalma pentry map entry indicates the index starting from 1 in the sample group description of CorruptedSampleInfoEntry. A groupID set to zero indicates that no entry is associated with this (the identified data is present and not corrupted).

By using a sample set to indicate whether a sample is corrupted, it becomes important to be able to signal to the reader whether the sample set should be supported (parsed and understood) to process the sample. In fact, in this case, sample entries alone may not be sufficient to determine whether a track is supported.

According to a particular embodiment, a new version of SampleGroupDescriptionBox is defined (e.g., version=3). If the version of SampleGroupDescriptionBox ("sgpd") is equal to 3, then the sample set description describes the basic information (essential information) of the associated sample, and the parser, player or reader should not try to illustrate that there is any track marked as a basic unidentified sample set description.

In a simple version 3 variant of the SampleGroupDescriptionBox ("sgpd") box, the basic indication may be indicated by a new parameter in the "sgpd" box, for example called "essential", as follows:

when the essential parameter takes a value of 0, the sample group entry declared in the SampleGroupDescriptionBox ("sgpd") box is descriptive and may not be disclosed to a parser, player, or reader. When the essential parameter takes a value of 1, the sample group entries declared in the SampleGroupDescriptionBox ("sgpd") box are forced to be supported to properly process samples mapped to those sample group entries and should be disclosed to a parser, player or reader.

The nature of the sample properties defined in the file by the basic sample group should be published through the MIME subparameter. This informs the parser or reader of the additional expected needs to support the file. For example, when a corrupted sample set is declared basic, a player with a base decoder may not support tracks with such basic sample sets. On the other hand, a media player with a robust decoder (e.g., with concealment capabilities) may support a track with such a set of basic samples. This new subparameter (e.g., called "essential") takes as values a comma-separated list of four-character codes corresponding to grouping_types declared as a basic set of samples. For example:

codecs＝"avc1.420034",essential＝"4CC0"

an AVC track indicating a basic sample group (e.g., 4CC 0) with one packet type.

As another example:

codecs＝"hvc1.1.6.L186.80",essential＝"4CC1,4CC2"

indicating the HEVC track with two basic sample groups of packet type (e.g., 4CC1 and 4CC 2).

As a variant, instead of defining a new version of the SampleGroupDescriptionBox ("sgdb") box, a new value may be defined in the flags parameter of the SampleGroupDescriptionBox ("sgdb") box as follows:

the essential: for 4 (0 x 000004) and when set to 1, the flag indicates that the sample group description describes basic information of the associated sample, and the file processor or file reader should not test pattern that there is any track marked as basic unidentified sample group description.

As another variation, a new box name samplegroupe differential propertybox with a four-character code "sgep" (or any other name or non-conflicting 4 CC) may be defined with the same syntax and semantics as SampleGroupDescriptionBox ("sgpd") except for signaling the basic properties of the sample set that must be supported by the reader to process the track.

Support for the new version SampleGroupDescriptionBox (or variants thereof) is forced to the reader by defining new categories within the sampletypebox ("ftyp"), segmentTypeBox ("styp"), or TrackTypeBox ("ttyp") box.

Thus, an example of a mechanism to signal the basic properties of a sample may depend on the use of two boxes:

a SampleToGroupBox ("sbgp") box describing the assignment of individual samples to sample groups and their basic sample group descriptions, an

A SampleGroupDescriptionBox ("sgpd") box having a basic flag value (or parameter, or samplegroupe assilprotybox ("sgep") box depending on the variables described above) that describes the basic properties of the samples within a particular sample group. The SampleGroupDescriptionBox ("sgpd") box (or samplegroupessalpropertybox ("sgep") box) contains a SampleGroupEntry list of video content, with each instance of SampleGroupEntry providing a different value for the basic attribute defined by a particular sample group (identified by its "grouping_type").

A particular type of basic sample packet is defined by combining a SampleToGroupBox ("sbgp") box and a SampleGroupDescriptionBox ("sgpd") box (or a samplegroupe assurepropertybox "sgep") box) with basic flag values or parameters via a type field ("grouping_type").

Similarly, by using a sample set of type "nalm" described in ISO/IEC 14496-15, the base attributes may be associated with one or more Network Abstraction Layer (NAL) units within the sample. By associating a sample group of type "nalm" with a basic sample group description box, basic attributes are associated with one or more NAL units within one or more samples. The SampletoGroupBox ("sbgp") box with a packet type "nalm" provides an index to nalemapentry assigned to each group of samples, and a grouping_type_parameter identifying the grouping_type of the associated basic sample group description box. NALUMapEntry associates a groupID with each NAL unit in a sample, and the associated groupID provides an index of an entry in a basic sample group description that provides basic attributes of the associated NAL unit.

For example, a grouping_type_parameter of the SampletoGroupBox ("sbgp") box with a grouping type of "nalm" may be equal to "corr" to associate a sample group of type "nalm" with a sample group description of type "corr" to signal NAL unit sets in corrupted or lost samples.

Preferably, when the presentation is segmented, a sample group with a given grouping_type should also be marked as basic if it is initially marked as basic (e.g. in the MovieBox "moov" box) and a sample group with the same grouping_type is defined in a subsequent media segment.

According to some embodiments, the principles of the basic sample group described above may be used as an extensible way of forcing the sample group description in the support rail. In particular, a basic sample set may be used to describe the conversion to be applied to the samples before or after decoding, wherein:

prior to decoding, i.e. when the content of the samples has been converted (e.g. by encryption) in such a way that it can no longer be decoded by a normal decoder, or when the content should be decoded only when the player understands and implements the protection system or scrambling operation applied to the samples; and

After decoding, i.e. when the file author requires some action on the decoded samples before playing or rendering (e.g. if the content of the decoded samples should be unpacked before rendering, e.g. for stereoscopic pictures where the left and right views are packed in the same picture).

The basic sample set provides several advantages over the transformations typically described using restricted sample entries or protected sample entries, including the following:

-sample entry four character code (4 CC) indicating sample coding format need not be changed. The original nature of the data in the track is not hidden, therefore, and the player does not have to parse the entire hierarchy of limited or protected sample entries to know the original format of the track,

the basic sample set allows more efficient definition of transitions with sample granularity (less impact on file size and segmentation),

the basic sample set can easily support many potential nested transformations, and

whenever a new property is defined, the basic sample set can easily support a new transformation.

Another benefit of using a generic mechanism (e.g., a basic sample set) to signal the transformations to be applied to the samples is to allow certain classes of file processors to operate on files using transformations that are unknown to the file processor. For methods that rely on restricted or protected sample entries, introducing a new transition through the sample entry requires updating the code of the dasher (e.g., the device that prepares the content to be streamed according to the dynamic HTTP adaptive streaming (DASH) standard) or the transcoder, which is always error-prone. Using a general mechanism such as basic sample sets, the dasher and transcoder can process files without knowing the nature of the conversion.

According to some embodiments, a transformation is defined as a basic sample set by declaring the basic sample set (e.g., using SampleGroupDescriptionBox for version=3, or any of the alternatives described previously), where the grouping_type value identifies the type of transformation (e.g., possibly corresponding to some known transformation's scheme type (e.g., "stvi" for stereo packing or "cenc" or "cbc1" for some general encryption 4 CCs), or possibly corresponding to some known transformation properties of 4 CCs (such as "clip" for clipping/cleaning aperture or "irot" for rotation), or any 4CC for newly defining a transformation). The converted attribute is declared in one or more than one samplegroupdescriptionboxes of the basic SampleGroupDescriptionBox (), each entry corresponding to a substitute value of the attribute. One of the one or more SampleGroupDescriptionEntry () is associated with each sample of the track using a default_group_description_index parameter of sampletodogroupbox or SampleGroupDescriptionBox.

In the alternative, when the conversion is basic, the conversion may be defined as a basic sample set (e.g., using SampleGroupDescriptionBox of version=3, or any of the alternatives described previously), or when the conversion is optional, the conversion may be defined as a normal sample set (i.e., non-basic) (e.g., using SampleGroupDescriptionBox of version <3, without any basic signaling).

When several sample groups defining conversion and/or descriptive properties are declared and associated with samples of a track, it is necessary to define the order in which the parser or reader must process the sample groups in order to correctly decode or render the individual samples.

According to some embodiments, a new basic sample set grouping_type, e.g. "esgh", is defined with SampleGroupDescriptionEntry EssentialDescriptionsHierarchyEntry as follows:

the EssentalDescriptionHierarchyEntry sample group description indicates the processing order of the basic sample group description applied to the given example, and

the eventualoscriptionHierarchyEntry sample group description is a basic sample group description and uses version 3 of sampleGroupdescriptionBox (or any other alternative described earlier). If there is at least one basic sample set description, then the sample set is present.

Each of the basic sample set descriptions is listed in the essentialDescriptionHierarchyEntry sample set description except for the essentialDescriptionHierarchyEntry.

The grouping_type_parameter described by the eventualoscriptionHierarchyEntry sample group is not defined and its value is set to 0.

The syntax and semantics of the EssentalDescriptionHierarchyEntry may be defined as follows:

Wherein,

num_groups indicates the number of basic sample group description types listed in the entry, and

sample_group_description_type indicates a four-character code applied to a basic sample group description of an associated sample. These types are listed sequentially, i.e., any potential sample processing described by the sample group of type sample_group_description_type [ i ] is applied before any potential sample processing described by the sample group of type sample_group_description_type [ i+1 ]. The reserved value "stsd" indicates the position of the decoding process in the conversion chain.

If the list of sample_group_description_type has no "stsd", then all listed basic sample groups are applied to the samples after decoding.

As an example, samples that were encrypted before being encoded may be signaled by a basic set of samples of grouping_type "vcne". If the same sample also requires the application of a post-processing filter after decoding, the post-processing can be signaled by the basic set of samples of the grouping_type "ppfi". According to this example, the essentialDescriptionHierarchyEntry is defined. It lists the transformations in the following order: [ "vcne", "stsd", "ppfi" ], to signal the order of nested transformations. From this signaling and following the order of conversion, the parser or reader may determine that the samples must be decrypted before being decoded by the decoder identified by the sample entry and must be processed by a post-processing filter before being rendered or displayed.

In a variant, instead of using the reserved value "stsd" in the sample_group_description_type [ ] array of the elementary descriptionhierarchy to distinguish between pre-decoding and post-decoding basic sample groups, the nature of the basic sample groups (pre-decoding or post-decoding) is determined from the definition associated with the four character code (4 CC) stored in the grouping_type parameter of the basic SampleGroupDescriptionBox and identifying the conversion or descriptive attribute. For example, it may be defined that "clip" actually identifies a clipping/cleaning aperture transition, and that transition is a post-decoding transition. As another example, it may be defined that "cenc" actually identifies a general encryption method (e.g., AES-CTR mode full sample and video NAL sub-sample encryption), and that the conversion is a pre-decoding conversion.

Thus, sample_group_description_type is defined as follows: which indicates the four character code applied to the basic sample set description of the associated sample. These types are listed sequentially, i.e., any potential sample processing described by the sample group of type sample_group_description_type [ i ] is applied before any potential sample processing described by the sample group of type sample_group_description_type [ i+1 ]. All pre-decoding basic sample set descriptions are listed first, followed in order by all post-decoding basic sample set descriptions.

In another variation, the properties of the basic set of samples (either before or after decoding) are explicitly signaled using parameters in the basic SampleGroupDescriptionBox. For example, the parameter may correspond to a "flags" parameter in the box header of the basic SampleGroupDescriptionBox. A new flag value pre-decoding_group_description, e.g., a value of 4 (0 x 100), may be defined. When set to 1, the flag indicates that the base sample set is a pre-decoding base sample set. Otherwise, when set to 0, the flag indicates that the base sample set is a decoded base sample set.

For the previous variant, all pre-decoding base sample set descriptions are thus listed first in the sample_group_description_type [ ] array of the elementary descriptions hierarchy, followed in order by all post-decoding base sample set descriptions.

In another variation, three different properties of the basic sample set can be distinguished: pre-decoding conversion, post-decoding conversion, and descriptive attributes. The nature of the basic sample set may be signaled using a method similar to any of the variants described above (i.e., as part of the semantics of a four-character code that identifies the type of sample set, or by using parameters in the basic SampleGroupDescriptionBox). In the case of using the "flags" parameter in the frame header of the basic SampleGroupDescriptionBox to signal the nature of the basic sample group, the following 2-bit flag values (bits 3 and 4 corresponding to the "flags" parameter) may be defined:

A descriptive_group_description of value 0 (0 x 00), which when set indicates that the basic sample set is a descriptive basic sample set,

a pre-decoding_group_description of value 1 (0 x 01), which when set indicates that the basic sample set is a pre-decoded basic sample set,

-post-decoding_group_description of value 2 (0 x 10), which when set indicates that the basic sample set is a decoded basic sample set, and

-a reserved value of 3 (0 x 11).

According to this variant, all pre-decoding basic sample set descriptions are therefore first listed in the sample_group_description_type [ ] array of the elementary descriptive hierarchy, followed in order by all descriptive basic sample sets, followed by all post-decoding basic sample set descriptions.

In the alternative, all pre-decoding base sample set descriptions are first listed in the sample_group_description_type [ ] array of the elementary descriptions hierarchy, followed in order by all post-decoding base sample sets, followed by all descriptive base sample set descriptions.

The nature of the sample properties defined in the file by the basic sample groups and the order of these basic sample groups should be published through MIME subparameters. This informs the parser or reader of the additional expected needs to support the file. The basic sample groups to be applied before decoding are listed in their application order (as shown in the "esgh" sample group description) in the "codec" sub-parameters to ensure maximum compatibility with existing practices. Other basic sample groups are listed in a new subparameter (e.g., called "essential") that takes as values a comma-separated list of four-character codes corresponding to the grouping_types of the basic sample groups to be applied after decoding.

In other words, when there is a basic sample group description for a track:

the four-character codes of the basic sample group description applied before the decoding process are listed in the "codes" subparameter in the order in which they are applied before the codec configuration. "dot" is used to separate individual basic sample set descriptions

The four-character code of the basic sample group description applied after the decoding process may be listed in the "essentials" subparameter in the order in which it is applied. The values described for the basic sample set are separated by dots.

As an example, samples that were encrypted prior to encoding may be signaled by a basic set of samples of grouping_type "vcne". If the same sample also requires the post-processing filter to be applied after being decoded, the post-processing can be signaled by the basic sample set of grouping_type "ppfi". According to this example and as described above, the Essenal DescriptionHierarchyEntry is defined and the transformations are listed in the following order [ "vcne", "stsd", "ppfi" ] to signal the order of nested transformations.

Thus, the "codes" mime type subparameter may be:

codecs＝vcne.hvc1.1.6.L186.80

this signals to the reader that the samples have been encrypted and must be decrypted according to the transformations identified by 4cc "vcne" before being decoded by the codec and profile level decoder conforming to the one identified by "hvc 1.1.6.l186.80".

And the "essential" mime type subparameter is:

essential＝ppfi

this signals to the reader that the post-processing filter identified by the basic sample set "ppfi" should be applied to the samples after decoding and before rendering.

In the alternative, the EssentalDescriptionHierarchyEntry sample group description indicates the order of processing applied to the basic and non-basic sample group descriptions for a given sample. The non-basic sample set description may be post-decoding conversion or descriptive information.

In a variant, rather than describing the processing order of the sample group in the essentialDescriptionHierarchyEntry sample group description, essentialDescriptionHierarchyEntry may be declared as sampleTableBox or TrackFragmentBux defined sampleGroupdescriptionHierarchyBox as follows:

wherein the same syntax and semantics as those of the eventialdescriptionhierarchylentry are used.

In this case, the process order of declaring the sample group in the samplegroupdescriptionhierarchy box of TrackFragmentBox will replace the process order declared in samplegroupdescriptionhierarchy box in any previous TrackFragmentBox with the same trackID and in the previous SampleTableBox in TrackBox with the same trackID.

According to some embodiments of the present invention, and as an alternative to the embodiments shown in fig. 3 and 4, a set of base samples is used instead of defining a sample description box ("stsd") in a track segmentation box ("traf") to declare codec switching in a track and dynamic sample descriptions or sample entries in a track.

A new basic sample group description with grouping_type "stsd" is defined, whose entries are present in the sample description subframe (i.e., sampleEntry box providing a description of samples of a given coding format), and samples are assigned to the correct entries using SampleToGroupBox with the same grouping_type or by default using the default_group_description_index parameter of the basic SampleGroupDescriptionBox.

According to the first modification, the sample description box "stsd" in the SampleTableBox of the track is an empty box (no sample entry box is defined inside), or the sample description box "stsd" is not defined in the SampleTableBox of the track. All sample entries are declared in the basic sample set description box with grouping_type "stsd", which may be defined in SampleTableBox of the track (i.e., the trackBox in MovieBox) or any animation segment or both.

To allow the parser to retrieve the definition of the sample entry associated with the sample, the semantics of sample_description_index in the track fragment header "tfhd" and default_sample_description_index in the TrackExtendsBox "trex" can be redefined as follows:

a value of 0 means that the mapping from the sample to the sample description index is done by SampleToGroupBox of the packet type "stsd",

-the value greater than 0 and less than or equal to 0x10000 indicates the index of the "stsd" entry in the sample group description of the packet type "stsd" defined in the track box, 1 being the first entry, and

a value strictly greater than 0x10000 gives an index (value-0 x 10000) in the sample group description "sgpd" of the packet type "stsd" defined in the track fragment "traf", 1 being the first entry.

According to a second variant, the sample entry may be declared in a sample description box "stsd" in the SampleTableBox of the track and/or in a basic sample group description box with grouping_type "stsd", which may be defined in any animation segment. The basic sample group description box with grouping_type "stsd" cannot be defined in the SampleTableBox of the track.

-a value greater than 0 and less than or equal to 0x10000 indicating an index of the "stsd" entry in the track defined in MovieBox, 1 being the first entry, and

According to a third variant, the sample entry may be declared in the sample description box "stsd" in the SampleTableBox of the track, or in the basic sample group description box with grouping_type "stsd", or in both. And a basic sample group description box with grouping_type "stsd" may be defined in the SampleTableBox of the track or in any animation segment or in both.

A value greater than 0 and less than or equal to 0x10000 indicates an index of an "stsd" entry in a track defined in MovieBox, 1 being the first entry,

-index (value-0 x 10000) of "stsd" entry in sample group description of packet type "stsd" defined in track box, with value strictly greater than 0x10000 and less than or equal to 0x20000, 1 being the first entry, and

a value strictly greater than 0x20000 gives an index (value-0 x 20000) in the sample group description "sgpd" of the packet type "stsd" defined in the track fragment "traf", 1 being the first entry.

Thus, according to these embodiments, the present invention provides a method for encapsulating media data units, the method being performed by a server and comprising:

identifying a set of media data units based on parameters independent of the encapsulation,

obtaining an indication of a transition to be applied to each media data unit in the set of media data units before or after parsing or decoding the encapsulated media data units, an

Media data units are encapsulated in one or more media files, the media data units in the set of media data units being grouped in groups associated with indications of conversion.

The media data unit may be a sample.

According to some embodiments, the one or more media files further comprise additional indications that signal to the client to parse, decode or render the set of media data units only when the client already has knowledge about the indication of the conversion.

Still according to some embodiments, the steps of identifying a set of media data units, obtaining an indication of a transformation, and encapsulating the media data units are repeated, the method further comprising obtaining an order in which the transformation is applied, and further comprising encapsulating an information item representing the order.

Still according to some embodiments, one or more of the transformations and sequences may be provided as attributes of one or more media files.

Still further in accordance with the foregoing embodiments, the present invention provides a method for parsing a packaged media data unit, the method performed by a client and comprising:

an indication of a transition to be applied to each media data unit in the group of media data units before or after parsing or decoding the encapsulated media data unit is obtained,

obtaining the encapsulated media data units in the set of media data units;

the obtained encapsulated media data units are parsed from the one or more media files while applying a transformation to each of the obtained encapsulated media data units in accordance with the obtained indication, before parsing or decoding the obtained encapsulated media data units.

Again, the media data units may be samples.

According to some embodiments, the one or more media files further comprise additional indications that signal to the client to parse, decode or render the encapsulated media data units in the set of media data units only when the client has knowledge about the indication of the conversion.

Still according to some embodiments, the steps of obtaining an indication of the conversion, obtaining the encapsulated media data units, parsing the obtained encapsulated media data units, and applying the conversion are repeated, the method further comprising obtaining an order of applying the conversion from the one or more media files, wherein the conversion is applied according to the obtained order.

Still according to some embodiments, one or more transformations and sequences are obtained from attributes of one or more media files.

The benefit of using a basic sample group description for a sample entry is that it allows for the track segments to be de-associated from the sample description index (given in the track segment header), avoiding the creation of new track segments when the sample description changes.

More generally, signaling basic features may be useful for signaling to a reader other attributes applied to the presentation of various sample sets (tracks, groups of entities) that are basically supported (parsed and understood) to draw or process the sample set or the overall presentation.

According to particular embodiments, the tracks may be grouped together to form one or more groups of tracks, where each group shares a particular characteristic, or the tracks within a group have a particular relationship signaled by a particular 4CC value in the track_group_type parameter. The track group is declared by defining a TrackGroupTypeBox box having the same track_group_type parameter and the same group identifier track_group_id within each TrackBox ("trak") box of tracks belonging to the track group.

A basic flag value (e.g., value=0x2) may be defined in the flag of the TrackGroupTypeBox box to signal to the reader that the semantics of a particular track group having specific values of the track_group_type parameter and track_group_id should be supported (parsed and understood) to draw or process the sample set formed by the track group. If the base flag is set for a track group and the parser does not understand the semantics of the track group, then the parser should not draw or process any track belonging to the track group.

When the basic flag is set in the TrackGroupTypeBox box of a track having a specific track_group_type parameter value, the basic flag should also be set in all TrackGroupTypeBox boxes of all other tracks belonging to the same track group having the same specific track_group_type parameter value.

To illustrate, signaling a track set of type "ster" that forms a stereoscopic pair suitable for playback on a stereoscopic display or signaling a track set of type "2dsr" that has a two-dimensional spatial relationship (e.g., corresponding to a spatial portion of a video source) is an example of a track set that may benefit from this basic signaling.

In a variant, if a base flag is set for a track group in the presentation and the parser does not understand the semantics of that track group signaled by the track group type parameter, the parser should not draw or process the overall presentation.

In another variation, a stock track group may be declared by defining a particular track group with a particular track group type (e.g., equal to "industrial") to signal to the reader that all tracks belonging to the track group are basic and should be supported (parsed and understood) to be drawn or processed together, i.e., the reader should support (parse and understand) all sample entries defined in the SampleDescriptionBox ("stsd") box of each track box ("trak") belonging to each track of the track group.

According to another embodiment, entities (i.e., tracks and/or items) may be grouped together to form one or more groups of entities, where each group shares a particular characteristic, or entities within a group have a particular relationship signaled by a particular 4CC value in a grouping_type parameter. In contrast to the tracks that represent the timing sequence of samples originally described by the MovieBox ("moov") box and the TrackBox ("trak") box sets, items represent untimed media data described by the MetaBox ("meta") box and the hierarchy of its boxes (e.g., including the iteminfofbox ("iinf") box, the itemiocationbox ("iloc") box, etc.).

Entity groups are declared by defining an EntityToGroupBox box within a MetaBox ("meta") box with a list of specific grouping_type parameters, group_ids, and entity identifiers (track_IDs or item_IDs). The MetaBox ("meta") box may be located at various levels, such as at a file level (i.e., the same level as the MovieBox ("moov") box), at an on-track level (i.e., within the TrackBox ("trak") box), at an animation fragment level (i.e., within the moviefragment box ("moof") box), or at an on-track fragment level (i.e., within the TrackFragmentBox ("traf") box).

A basic flag value (e.g., value=0x1) may be defined in the flag of the EntityToGroupBox box to signal to the reader that the semantics of the entity group declared by the EntityToGroupBox box with specific values of the grouping_type parameter and the group_id should be supported (parsed and understood) to draw or process the set of samples and items formed by the entity group. If the basic flag is set for a group of entities and the parser does not understand the semantics of the group of entities, then the parser should not render or process any of the tracks or items belonging to the group of entities.

In a variant, if the base flag is set for the group of entities in the presentation and the parser does not understand the semantics of the group of entities signaled by the grouping_type parameter, the parser should not draw or process the overall presentation.

In another variation, a basic entity group may be declared by defining a specific entity group with a specific grouping_type (e.g., equal to "ethyl") to signal to the reader that all tracks and/or items belonging to the entity group are basic and should be supported (parsed and understood) to be drawn or processed together, i.e., all sample entries defined in SampleDescriptionBox ("stsd") boxes of individual TrackBox belonging to the entity group and all items belonging to the entity group and defined in ItemInfobox ("iinf") boxes should be supported (parsed and understood) by the reader.

More generally, signaling basic features may be useful for signaling to a reader data structures or boxes that are basically supported (parsed and understood) to draw or process a presentation or portion of a presentation.

According to another embodiment, the track may be associated with an edit list that provides an explicit timeline mapping of the track. The individual entries of the edit list may define a portion of a track timeline: by mapping a portion of the composition timeline, or by indicating a "null" time (the portion of the presentation timeline that maps to no media, "null" edit), or by defining a "dwell" in which a single point in time in the media is maintained for a period of time.

A basic flag value (e.g., value=0x2) may be defined in the flag of the EditListBox ("elst") box to signal to the reader that the explicit timeline mapping defined by the edit list should be supported (parsed and applied) to draw or process the associated tracks and should not be ignored by the parser.

According to yet another embodiment, a basic flag value (e.g., value=0x800000) may be defined in the flag of the trackHeadsbox ("tkhd") box of the TrackBox ("trak") box to signal to the reader that the corresponding track should be supported (parsed and understood) to render or process the presentation, i.e., that all sample entries defined in the sampleDescriptionBox ("stsd") box of the TrackBox of the track should be supported (parsed and understood) by the reader to process the presentation.

According to another embodiment, a basic flag value (e.g., value=0x1) may be defined in the flag of the ItemInfoEntry ("inf") box in the ItemInfoBox ("iinf") box to signal to the reader that the corresponding item and all of its basic item properties should be supported (parsed and understood) to draw or process the presentation.

More generally, the base flag value may be defined in any full box (i.e., box with flag parameters) of the ISOBMFF and its derivative specifications to signal to the reader that the box should be supported (parsed and understood) by the reader and should not be ignored if not supported.

Fig. 7 is a block diagram illustrating an example of steps performed by a client or reader to obtain data according to some embodiments of the invention.

In a variant of the above embodiment, instead of prohibiting the reader from decoding any track for which there is an unidentified sample set description marked as basic, the following steps illustrate an alternative in which only samples associated with the properties of the basic sample set may be ignored by the reader.

In step 700, the reader obtains a sample by parsing the ISOBMFF file or segment file.

Next, in step 705, the reader checks if the sample belongs to a sample group that is signaled as a basic sample group according to some of the embodiments described above. If the sample set is basic, the reader checks if the packet type of the basic sample set is known (step 710).

If the grouping type of the basic sample group is known, then the sample (and other samples in the group) may be processed (step 715), otherwise the sample is ignored (step 720).

For example, in certain use cases, supplemental Enhancement Information (SEI) typically carried in an encoded video bitstream (e.g., for transmitting High Dynamic Range (HDR), virtual Reality (VR), or film grain type media data) may be transmitted in a sample set description labeled basic.

It is recalled that the media bitstream may carry additional information that may be used to assist the player in the processing of the media file (e.g., for decoding, display, or for other purposes). For example, the video bitstream may be provided with SEI messages defined as standard specifications (e.g., ISO/IEC 23002-7). This additional information may be used by the application and may be specified by other standards or guidelines or interoperability points in some federations (e.g., ATSC, DVB, ARIB, DVB, etc.) using the MPEG specification (e.g., compression or encapsulation or description specification). For example, some DVB specifications mandate alternative transmission characteristics SEI for HDR applications. Some other SCTE specifications impose additional information to allow closed caption data with the video stream. With the proliferation of such additional information and more complex media streams, some SEI messages or additional information may become mandatory or required to draw media presentations.

According to a particular embodiment, such SEI information is provided as visual samplegroupentry and is associated with a sample group, and the sample group is signaled as essential. For example, packet type values, each of the important or mandatory SEI, one packet type value, e.g., alternative transmission characteristics SEI for HDR, one value, frame packing SEI for stereoscopic applications, one value, and film grain characteristics for improving decoded images, one value may be defined and reserved. The payloads of these sample set entries will correspond to the payloads of the additional information. For example, for an SEI message, the NAL unit corresponding to the SEI message would be provided in the visual samplegroupentry in a SampleGroupDescriptionBox ("sgpd") box with a packet type indicating the SEI type. Using the basic sample set and MIME subparameter, the application then handles the media file with the awareness that some additional features of the codec should be supported. When a given set of samples share the same additional information, all NAL units corresponding to these SEI's may be provided at once using a single visual samplegroupentry. This allows the content creator to indicate as essential SEI messages that are desired to be processed by the player. This is especially relevant for SEI messages that are not listed in the sample description (these messages may only appear as arrays of NAL units in decoder configuration records within the sample entry). Having SEI messages in groups of samples rather than in NAL unit arrays configured by the decoder further allows SEI persistence to be handled easily, such as when applied to some samples rather than systematically to the whole sequence. Furthermore, having SEI in the sample set would allow the content creator to indicate at the encapsulation end the important or required SEI that the player, client or application support is desired to render the media file. Furthermore, when disclosed in the MIME subparameter, the parser or application can decide whether they can support media presentation.

FIG. 8 is a schematic block diagram of a computing device 800 for implementing one or more embodiments of the invention. The computing device 800 may be a device such as a microcomputer, workstation, or lightweight portable device. Computing device 800 includes a communication bus 802 that connects to:

a Central Processing Unit (CPU) 804, such as a microprocessor or the like;

random Access Memory (RAM) 808 for storing executable code of the method of an embodiment of the invention and registers adapted to record variables and parameters necessary for implementing the method of packaging, indexing, unpacking and/or accessing data, the memory capacity of which can be extended, for example by means of an optional RAM connected to an expansion port;

a Read Only Memory (ROM) 806 for storing a computer program for implementing an embodiment of the present invention;

a network interface 812, which in turn is typically connected to a communication network 814 through which digital data to be processed is transmitted or received. The network interface 812 may be a single network interface or be comprised of a collection of different network interfaces (e.g., wired and wireless interfaces, or different kinds of wired or wireless interfaces). Under control of a software application running in CPU 804, data is written to or read from a network interface for transmission or reception;

A User Interface (UI) 816 may be used to receive input from a user or to display information to a user;

-a Hard Disk (HD) 810; and/or

An I/O module 818 for receiving/transmitting data from/to an external device, such as a video source or display, etc.

Executable code may be stored in read-only memory 806, may be stored on hard disk 810, or may be stored on a removable digital medium (such as a disk, etc.), for example. According to a variant, the executable code of the program may be received via the network interface 812 by means of the communication network such that the executable code of the program is stored in one of the storage devices of the communication device 800 (such as the hard disk 810, etc.) before being executed.

The central processing unit 804 is adapted to control and direct the execution of instructions or portions of software code of one or more programs according to embodiments of the present invention, which instructions are stored in one of the aforementioned storage devices. After power is turned on, CPU 804 is able to execute instructions from main RAM memory 808, for example, in connection with a software application, after loading those instructions from program ROM 806 or Hard Disk (HD) 810. Such software applications, when executed by CPU 804, cause the steps of the flowcharts shown in the previous figures to be performed.

In this embodiment, the device is a programmable device that implements the invention using software. However, the invention may alternatively be implemented in hardware (e.g., in the form of an application specific integrated circuit or ASIC).

Although the present invention has been described above with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications within the scope of the present invention will be apparent to those skilled in the art.

Many further modifications and variations will occur to those skilled in the art upon reference to the foregoing illustrative embodiments, which are given by way of example only and are not intended to limit the scope of the invention, which is determined solely by the appended claims. In particular, different features from different embodiments may be interchanged where appropriate.

In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A method for encapsulating media data, the method performed by a server and comprising:

Identifying a portion of the media data or a collection of information items related to a portion of the media data according to parameters independent of the encapsulation; and

encapsulating the portion of the media data or the set of information items in a media file as entities, the entities being grouped into a set of entities associated with a first indication representing the parameter,

wherein the media file comprises a second indication that signals to a client to parse the set of entities only if the client has knowledge about the first indication.

2. A method for parsing encapsulated media data, the method performed by a client and comprising:

determining that the encapsulated media data includes a second indication that signals to parse the set of entities only if the client has knowledge of the first indication associated with the set of entities to parse;

obtaining a reference of an entity set to be parsed;

obtaining a first indication associated with a set of entities that have obtained a reference; and

in case the client does not have knowledge about the obtained first indication associated with the set of entities that have obtained the reference, the set of entities that have obtained the reference is ignored when parsing the packaged media data.

3. The method of claim 1 or 2, wherein the second indication further signals that the media data can be rendered only if the client has knowledge about the first indication.

4. A method according to any of claims 1 to 3, wherein the entity is a sample or supplementary information describing the media data.

5. The method of claim 4, wherein the entity is a sample, and wherein the set of entities is a set of corrupted samples.

6. A computer program product for a programmable device, the computer program product comprising instructions for performing the steps of the method according to any one of claims 1 to 5 when the program is loaded and executed by the programmable device.

7. A non-transitory computer readable storage medium storing instructions of a computer program for implementing the method of any one of claims 1 to 5.

8. A processing device comprising a processing unit configured to perform the steps of the method according to any one of claims 1 to 5.