WO2020058494A1 - Method, device, and computer program for improving transmission of encoded media data - Google Patents

Method, device, and computer program for improving transmission of encoded media data Download PDF

Info

Publication number
WO2020058494A1
WO2020058494A1 PCT/EP2019/075372 EP2019075372W WO2020058494A1 WO 2020058494 A1 WO2020058494 A1 WO 2020058494A1 EP 2019075372 W EP2019075372 W EP 2019075372W WO 2020058494 A1 WO2020058494 A1 WO 2020058494A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
samples
media data
metadata
box
Prior art date
Application number
PCT/EP2019/075372
Other languages
French (fr)
Inventor
Franck Denoual
Frédéric Maze
Naël OUEDRAOGO
Jean LE FEUVRE
Original Assignee
Canon Kabushiki Kaisha
Canon Europe Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Kabushiki Kaisha, Canon Europe Limited filed Critical Canon Kabushiki Kaisha
Publication of WO2020058494A1 publication Critical patent/WO2020058494A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440227Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by decomposing into layers, e.g. base layer and one or more enhancement layers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • H04L47/62Queue scheduling characterised by scheduling criteria
    • H04L47/625Queue scheduling characterised by scheduling criteria for service slots or service orders
    • H04L47/6275Queue scheduling characterised by scheduling criteria for service slots or service orders based on priority
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/70Media network packetisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234327Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by decomposing into layers, e.g. base layer and one or more enhancement layers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234345Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440245Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2212/00Encapsulation of packets

Definitions

  • the present invention relates to methods and devices for improving transmission of encoded media data and to methods and devices for encapsulating and parsing media data.
  • the invention is related to encapsulating, parsing and streaming media content, e.g. according to ISO Base Media File Format as defined by the MPEG standardization organization, to provide a flexible and extensible format that facilitates interchange, management, editing, and presentation of group of media content and to improve its delivery for example over an IP network such as Internet using adaptive http streaming protocol.
  • ISO Base Media File Format as defined by the MPEG standardization organization
  • the International Standard Organization Base Media File Format (ISO BMFF, ISO/IEC 14496-12) is a well-known flexible and extensible format that describes encoded timed media data bitstreams either for local storage or transmission via a network or via another bitstream delivery mechanism.
  • This file format has several extensions, e.g. Part-15, ISO/IEC 14496-15 that describes encapsulation tools for various NAL (Network Abstraction Layer) unit based video encoding formats. Examples of such encoding formats are AVC (Advanced Video Coding), SVC (Scalable Video Coding), HEVC (High Efficiency Video Coding) or L-HEVC (Layered HEVC).
  • file format extensions is the Image File Format, ISO/IEC 23008-12, that describes encapsulation tools for still images or sequence of still images such as HEVC Still Image.
  • This file format is object-oriented. It is composed of building blocks called boxes (or data structures characterized by a four character code) that are sequentially or hierarchically organized and that define descriptive parameters of the encoded timed media data bitstream such as timing and structure parameters.
  • boxes or data structures characterized by a four character code
  • descriptive parameters of the encoded timed media data bitstream such as timing and structure parameters.
  • the overall presentation over time is called a movie.
  • the movie is described by a movie box (with four character code‘moov’) at the top level of the media or presentation file.
  • This movie box represents an initialization information container containing a set of various boxes describing the presentation.
  • Each track (uniquely identified by a track identifier (track l D)) represents a timed sequence of media data pertaining to the presentation (frames of video, for example). Within each track, each timed unit of data is called a sample; this might be a frame of video, audio or timed metadata. Samples are implicitly numbered in sequence. The actual sample data are in boxes called Media Data Boxes (with four character code‘mdat’) at same level than the movie box. The movie may also be fragmented, i.e. organized temporally as a movie box containing information for the whole presentation followed by a list of couple movie fragment and Media Data box.
  • a movie fragment within a movie fragment (box with four-character code‘moof) there is a set of track fragments (box with four character code‘traf), zero or more per movie fragment.
  • the track fragments in turn contain zero or more track run boxes (with four character code ‘trun’), each of which documents a contiguous run of samples for that track fragment.
  • CMAF The MPEG Common Media Application Format (MPEG CMAF, ISO/IEC 23000-19) derives from ISOBMFF and provides an optimized file format for streaming delivery.
  • CMAF specifies CMAF addressable media objects derived from encoded CMAF fragments, which can be referenced as resources by a manifest.
  • a CMAF fragment is an encoded ISOBMFF media segment, i.e. one or more Movie Fragment Boxes (‘moof,‘traf, etc.) with their associated media data‘mdaf and other possible associated boxes.
  • CMAF also defines CMAF chunk that is a single pair of ‘moof and ‘mdaf boxes.
  • CMAF also defines CMAF segments that are addressable media resource containing one or more CMAF fragments.
  • Media data encapsulated with ISOBMFF or CMAF can be used for adaptive streaming with HTTP.
  • MPEG DASH for“Dynamic Adaptive Streaming over HTTP”
  • Smooth Streaming are HTTP adaptive streaming protocols that allow segment or fragment based delivery of media files.
  • the MPEG DASH standard see “ISO/IEC 23009-1 , Dynamic adaptive streaming over HTTP (DASH), Parti : Media presentation description and segment formats”) enables to create an association between a compact description of the content(s) of a media presentation and the HTTP addresses. Usually, this association is described in a file called a manifest file or description file. In the context of DASH, this manifest file is a file also called the MPD file (for Media Presentation Description).
  • DASH defines several types of segments, mainly initialization segments, media segments or index segments.
  • An initialization segments contain setup information and metadata describing the media content, typically at least the‘ftyp’ and‘moov’ boxes of an ISOBMFF media file.
  • a media segment contains the media data.
  • the DASH manifest may provide segment URLs or a base URL to the file with byte ranges to segments for a streaming client to address these segments through HTTP requests.
  • the byte range information may be provided by index segments or by specific ISOBMFF boxes like the Segment Index Box ‘sidx’ or the SubSegment Index Box‘ssix’.
  • the media presentation description contains DASH index segments (or indexed segments) describing, in terms of byte ranges, the encapsulated ISOBMFF movie fragments.
  • DASH index segments or indexed segments
  • a mapping of time to byte ranges may be provided in DASH index segments (or indexed segments).
  • a mapping of levels (L0, L1 , L2) to byte ranges may be provided, the levels being declared in a level assignment box ‘leva’ box.
  • the level assignment may be based on the sample group (the assignment_type of the leva box is set to the value 0 and its grouping type is set, for example, to ’tele’) describing sub-temporal layers.
  • the so-indexed sub-segments provide a list of byte ranges to get samples of a given sub-temporal layer.
  • each level may be described as a SubRepresentation element (e.g. an XML schema).
  • SubRepresentation element e.g. an XML schema.
  • the present invention has been devised to address one or more of the foregoing concerns and more generally to improve transmission of encoded media data.
  • a method for encapsulating timed media data the timed media data being requested by a client, the method being carried out by a server and comprising:
  • the fragment comprising a set of contiguous samples of the timed media data
  • a metadata item comprises a flag indicating whether a data offset is coded on a predetermined size or not, the data offset referring to the timed media data
  • the method of the invention makes it possible to optimize coding of the description data when encapsulating timed media data.
  • a server for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
  • the fragment comprising a set of contiguous samples of the timed media data
  • the metadata comprising structured metadata items, a metadata item of the structured metadata items comprising a configurable parameter having a configurable size, wherein the metadata comprises an indication information indicating whether a sample count field is present or not; and, encapsulating the timed media data and the generated metadata.
  • the method of the invention makes it possible to optimize coding of the description data when encapsulating timed media data.
  • a third aspect of the invention there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
  • the fragment comprising a set of contiguous samples of the timed media data
  • the metadata comprising structured metadata items, a metadata item of the structured metadata items comprising a configurable parameter having a configurable coding size, wherein the metadata comprises a flag indicating the coding size of the configurable parameter;
  • the method of the invention makes it possible to optimize coding of the description data when encapsulating timed media data.
  • a server for encapsulating timed media data, the timed media data being requested by a client, the method being carried out by a server and comprising:
  • the fragment comprising a set of contiguous samples of the timed media data
  • generating metadata describing the obtained fragment the metadata comprising structured metadata items, wherein a metadata item of the structured metadata items comprises an indication information indicating whether a composition time offset parameter is coded as a multiple of a sample duration or of a time scale;
  • the method of the invention makes it possible to optimize coding of the description data when encapsulating timed media data.
  • the samples of the set of contiguous samples of the timed media data are ordered according to a first ordering
  • the samples of the set of contiguous samples are encapsulated according to a second ordering, the second ordering depending on a priority level associated with each of the samples of the set of contiguous samples for processing the encapsulated samples, upon decapsulation
  • the generated metadata comprise reordering information associated with the encapsulated samples for re-ordering the encapsulated samples according to the first ordering, upon decapsulation.
  • the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
  • each parameter value of the list is a position index, each position index being determined as a function of an offset and of the coding length of the obtained samples.
  • the samples of the set of contiguous samples are encapsulated using the generated metadata.
  • the method further comprises obtaining a priority map associated with the samples of the set of contiguous samples, the reordering information being determined as a function of the obtained priority map.
  • the format of the encapsulated timed media data is of the ISOBMFF type or of the CMAF type.
  • a method for encapsulating timed media data comprising:
  • the fragment comprising a set of contiguous samples of the timed media data, the samples of the set of contiguous samples being ordered according to a first ordering;
  • samples of the set of contiguous samples being encapsulated according to a second ordering, the second ordering depending on a level associated with each of the samples of the set of contiguous samples for processing the encapsulated samples, upon decapsulation,
  • the generated metadata comprise reordering information associated with the encapsulated samples for re-ordering the encapsulated samples according to the first ordering, upon decapsulation.
  • the method of the invention makes it possible to reduce description cost of fragmented media data, in particular of fragmented media data conforming ISOBMFF, and to provide a flexible organisation (reordering) of the media data (samples) with limited signalling overhead.
  • Fragmenting the data and ordering the samples according to a level associated with each sample enable transmission of particular samples first.
  • a level may be, for example, a temporal level, a spatial level, a quality level, a level directed to a region of interest, or a priority level.
  • a sixth aspect of the invention there is provided a method for encapsulating encoded media data, the method comprising:
  • samples of the encoded media data are ordered according to a first ordering; and encapsulating samples of the obtained samples, ordered according to a second ordering, the second ordering depending on a priority level associated with each of the obtained samples for processing the encapsulated samples, upon decapsulation; and reordering information associated with the encapsulated samples for re- ordering the encapsulated samples according to the first ordering, upon decapsulation.
  • the method of the invention makes it possible to reduce description cost of fragmented media data, in particular of fragmented media data conforming ISOBMFF, and to provide a flexible organisation (reordering) of the media data (samples) with limited signalling overhead. Fragmenting the data and ordering the samples according to a priority level associated with each sample enable transmission of the most important samples first which leads to reducing freeze of video media display when temporal sublayers are split over fragments and transmission errors occur.
  • the method of the invention makes it possible for the number of different byte ranges with different FEC (forward error correction) settings to be lowered, hence simplifying and improving the FEC part.
  • the media data are timed media data and the obtained samples of the encoded media data correspond to a plurality of contiguous timed media data samples, the reordering information being encoded within metadata associated with the plurality of contiguous timed media data samples.
  • the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
  • each parameter value of the list is a position index, each position index being determined as a function of an offset and of the coding length of the obtained samples.
  • the obtained samples are encapsulated using the metadata associated with the samples.
  • the method further comprises obtaining a priority map associated with the obtained samples, the reordering information being determined as a function of the obtained priority map.
  • obtaining samples of the encoded media data comprises obtaining samples of the media data and encoding the obtained samples of the media data.
  • the priority levels are obtained from the encoding of the obtained samples of the media data. In an embodiment, the priority levels are determined as a function of dependencies between the obtained samples of the media data.
  • a seventh aspect of the invention there is provided a method for transmitting encoded media data from a server to a client, the media data being requested by the client, the method being carried out by the server and comprising encapsulating the encoded media data according to the method described above and transmitting, to the client, the encapsulated encoded media data.
  • the seventh aspect of the present invention has advantages similar to the first above-mentioned aspect.
  • a method for processing encapsulated media data comprising:
  • samples of the encapsulated media data the obtained samples of the encapsulated media data being ordered according to a second ordering
  • the eighth aspect of the present invention has advantages similar to the first above-mentioned aspect.
  • the media data are timed media data and the obtained samples of the encapsulated media data corresponds to a plurality of contiguous timed media data samples, the reordering information being encoded within metadata associated with the plurality of contiguous timed media data samples.
  • the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
  • reordering the obtained samples comprises computing offsets as a function of the parameter values and of coding lengths of the encoded samples.
  • the method further comprises decoding the reordered samples.
  • the method is carried out in a client, the samples of the encapsulated media data and the reordering information being received from a server.
  • the format of the encapsulated media data is of the ISOBMFF type or of the CMAF type.
  • a ninth aspect of the invention there is provided a signal carrying an information dataset for media data, the information dataset comprising encapsulated encoded media data samples and reordering information, the reordering information comprising a description of an order of samples for decoding the encoded samples.
  • the ninth aspect of the present invention has advantages similar to the first above-mentioned aspect.
  • a media storage device storing a signal carrying an information dataset for media data, the information dataset comprising encapsulated encoded media data samples and reordering information, the reordering information comprising a description of an order of samples for decoding the encoded samples.
  • the tenth aspect of the present invention has advantages similar to the first above-mentioned aspect.
  • a device for transmitting or receiving encapsulated media data comprising a processing unit configured for carrying out each of the steps of the method described above.
  • the eleventh aspect of the present invention has advantages similar to the first above-mentioned aspect.
  • the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit", "module” or "system”.
  • the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
  • a tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like.
  • a transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.
  • Figure 1 illustrates the general architecture of a system comprising a server and a client exchanging HTTP messages
  • Figure 2 describes the protocol stack according to embodiments of the invention
  • FIG. 3 illustrates a typical client server system for media streaming according to embodiments of the invention
  • Figure 4 illustrates an example of processing carried out in media server and in media client, according to embodiments
  • Figure 5a illustrates an example of dependencies of video frames, that are to be taken into account for coding or decoding a frame
  • Figure 5b illustrates an example of reordering samples of a video stream during encoding and encapsulating steps
  • Figure 6a illustrates an example of steps for reordering samples of an encoded stream in an encapsulated stream
  • Figure 6b is an example of a data structure used for reordering samples
  • Figure 7 illustrates an example of steps for reordering samples of an encapsulated stream in an encoded video stream
  • Figure 8 illustrates DASH index segments (or indexed segments) describing, in terms of byte ranges, encapsulated ISOBMFF movie fragments, wherein each subsegment comprises a mapping of levels to byte ranges;
  • Figures 9 and 10 illustrate examples of reordering and mapping of samples having levels associated therewith.
  • FIG 11 schematically illustrates a processing device configured to implement at least one embodiment of the present invention DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
  • a video bit-stream is usually organized into Group Of Pictures (GOP). This is the case for example with MPEG video compression standards like AVC/H.264 or HEVC/H.265.
  • GOP Group Of Pictures
  • B N having no dependencies on B N+1
  • N being an indication of a level, e.g. temporal layer or scalability layer or set of samples with a given priority level, is often encapsulated in a media file or media segments in the decoding order, for example as follows for a one second video:
  • indicia indicates the composition or presentation order.
  • An expected layout may be based on priority level values, as follows:
  • one‘trun’ may be used each time the sample continuity is broken, i.e. each time a sample in the expected layout (ordered according to the priority level values) is not the sample following the previous sample in decoding order, as follows:
  • the‘trun’ box is improved, in particular by adding a sample processing order that avoids such a repetition of the‘trun’ box.
  • the ISO/IEC 14496-12 defines the Track Extends Box to define default values used by the movie fragments in a media file or in a set of media segments.
  • the track fragment header box ‘tfhd’ also sets up information and default values for the runs of samples contained in one movie fragment.
  • a run of samples is described in a‘trun’ box as one or more parameters such as a number of samples, an optional a data offset, optional dependency information related to the first sample in the run, and, for each sample in the run, optional sample_duration, sample_size, sample_flags (for dependency/priority), and composition time information.
  • a track fragment may contain zero or more standard‘trun’ or compact trun ‘ctrn’ boxes.
  • TrackRun Pattern Box enables cyclic assignment of repetitive track run patterns to samples of track runs.
  • One or more track run patterns is specified in the‘trup’ box. For each sample in a track run pattern, the sample_duration, samplejlags, sample_composition_time_offset and the number of bits to encode the sample_size are conditionally provided depending on the box flags.
  • a sample processing order is indicated in an encapsulated media data file or in a companion file (e.g. a companion file referencing an encapsulated media data file) to give information about data significance of encapsulated data of the encapsulated media data file, the encapsulated data typically comprising media data and descriptive metadata, so that these encapsulated data may be handled appropriately.
  • a companion file e.g. a companion file referencing an encapsulated media data file
  • the sample processing order may be used at the server end to organise samples of a fragment of encoded media data, according to their priority, for example for transmitting the most important samples first.
  • the sample processing order is used to parse a received encapsulated stream and to provide a decodable stream.
  • the encapsulated media data may be directed to different kinds of media resources or media components such as an image sequence, one or more video tracks with or without associated audio tracks, auxiliary or metadata tracks.
  • the sample processing order associated with a file comprising encapsulated media data are defined in the‘trun’ box.
  • Figure 1 illustrates the general architecture of a system comprising a server and a client exchanging HTTP messages.
  • the client denoted 100 sends an HTTP message denoted 140 to the server denoted 1 10, through a connection denoted 130 established over a network denoted 120.
  • HTTP HyperText Transfer Protocol
  • the client sends an HTTP request to the server that replies with an HTTP response.
  • Both HTTP request and HTTP response are HTTP messages.
  • HTTP messages can be directed to the exchange of media description information, the exchange of media configuration or description, or the exchange of actual media data.
  • the client may thus be a sender and a receiver of HTTP messages.
  • the server may be a sender and a receiver of HTTP messages.
  • HTTP requests are sent on a reliable basis while some HTTP responses may be sent on an unreliable basis.
  • a common use-case for the unreliable transmission of HTTP messages corresponds to the case according to which the server sends back to the client a media stream in an unreliable way.
  • the HTTP client could also send an HTTP request in an unreliable way, for example for sending a media stream to the server.
  • the HTTP client and the HTTP server can also negotiate that they will run in a reliable mode. In such a case, both HTTP requests and responses are sent in a reliable way.
  • Figure 2 illustrates an example of protocol stacks of a server 200, for example client 100 of Figure 1 , and of a server 250, for example server 1 10 of Figure 1.
  • the same protocol stack exists on both client 200 and server 250, making it possible to exchange data through a communication network.
  • the protocol stack receives, from application 205, a message to be sent through the network, for example message 140.
  • the message is received from the network and, as illustrated, the received message is processed at transport level 275 and then transmitted up to application 255 through the protocol stack that comprises several layers.
  • the protocol stack contains the application, denoted 205, at the top level.
  • this can be a web application, e.g. a client part running in a web browser.
  • the application is a media streaming application, for example using DASH protocol, to stream media data encapsulated according to ISO Base Media File Format.
  • Underneath is an HTTP layer denoted 210, which implements the HTTP protocol semantics, providing an API (application programming interface) for the application to send and receive messages.
  • Underneath is a transport adaptation layer (TA layer or TAL).
  • the TAL may be divided into two sublayers: a stream sublayer denoted 215 (TAL-stream, TA Stream sublayer, or TAS sublayer) and a packet sublayer denoted 220 (TAL-packet, TA Packet sublayer, or TAP sublayer), depending on whether the transport layer manipulates streams and packets or only packets.
  • a stream sublayer denoted 215 TAL-stream, TA Stream sublayer, or TAS sublayer
  • TAL-packet, TA Packet sublayer, or TAP sublayer TAL-packet, TA Packet sublayer, or TAP sublayer
  • the protocol stack contains the same layers.
  • the top level application denoted 255
  • the top level application may be the server part running in a web server.
  • the HTTP layer denoted 260, the TAS sublayer denoted 265, the TAP sublayer denoted 270, and the UDP layer denoted 275 are the counterparts of the layers 205, 210, 215, 220, and 225, respectively.
  • an item of information to be exchanged between the client and the server is obtained at a given level at the client’s end. It is transmitted through all the lower layers down to the network, is physically sent through the network to the server, and is transmitted through all the lower layers at the server’s end up to the same level as the initial level at the client’s end.
  • an item of information obtained at the HTTP layer from the application layer is encapsulated in an HTTP message. This HTTP message is then transmitted to TA stream sublayer 215, which transmits it to TA Packet sublayer 220, and so on down to the physical network.
  • the HTTP message is received from the physical network and transmitted to TA Packet sublayer 270, through TA Stream sublayer 265, up to HTTP layer 260, which decodes it to retrieve the item of information so as to provide it to application 255.
  • a message is generated at any level, transmitted through the network, and received by the server at the same level. From this point of view, all the lower layers are an abstraction that makes it possible to transmit a message from a client to a server. This logical point of view is adopted below.
  • the transport adaptation layer is a transport protocol built on top of UDP and targeted at transporting HTTP messages.
  • TAS sublayer provides streams that are bi-directional logical channels.
  • a stream is used to transport a request from the client to the server and the corresponding response from the server back to the client.
  • a TA stream is used for each pair of request and response.
  • one TA stream associated with a request and response exchange is dedicated to carrying the request body and the response body.
  • header fields of the HTTP requests and responses are carried by a specific TA stream. These header fields may be encoded using HPACK when the version of HTTP in use is HTTP/2 (HPACK is a compression format for efficiently representing HTTP header fields, to be used in HTTP/2).
  • data may be split into TA frames.
  • One or more TA frames may be encapsulated into a TA packet which may itself be encapsulated into a UDP packet to be transferred between the client and the server.
  • the STREAM frames carry data corresponding to TA streams
  • the ACK frames carry control information about received TA packets
  • other frames are used for controlling the TA connection.
  • TA packets one of those being used to carry TA frames.
  • Figure 3 illustrates an example of a client-server system wherein embodiments of the invention may be implemented. It is to be noted that the implementation of the invention is not limited to such a system as it may concern the generation of media files that may be distributed in any way, not only by streaming over a communication network but also for local storage and rendering by a media player.
  • the system comprises, at the server’s end, media encoders 300, in particular a video encoder, a media packager 310 to encapsulate data, and a media server 320.
  • media packager 310 comprises a NALU (NAL Unit) parser 31 1 , a memory 312, and an ISOBMFF writer 313. It is to be noted that the media packager 310 may use a file format other than ISOBMFF.
  • the media server 320 can generate a manifest file (also known as a media presentation description (MPD) file)) 321 and media segments 322.
  • MPD media presentation description
  • the system further comprises media client 350 having ISOBMFF parser 352, media decoders 353, in particular a video decoder, a display 354, and an HTTP client 351 that supports adaptive HTTP streaming, in particular parsing of streaming manifest, denoted 359, to control the streaming of media segments 390.
  • media client 350 further contains transformation module 355 which is a module capable of performing operations on encoded bit-streams (e.g. concatenation) and/or decoded picture (e.g. post-filtering, cropping, etc.).
  • media client 350 requests manifest file 321 in order to get the description of the different media representations available on media server 320, that compose a media presentation.
  • media client 350 requests the media segments (denoted 322) it is interested in. These requests are made via HTTP module 351 .
  • the received media segments are then parsed by ISOBMFF parser 352, decoded by video decoder 353, and optionally transformed or post-processed in transformation unit 355, to be played on display 354.
  • a video sequence is typically encoded by a video encoder of media 300, for example a video encoder of the H.264/AVC or H.265/HE VC type.
  • the resulting bit- stream is encapsulated into one or several files by media packager 310 and the generated files are made available to clients by media server 320.
  • the system further comprises an ordering unit 330 that may be part of the media packager or not.
  • the ordering unit aims at defining the order of the samples so as to optimize the transmission of a fragment.
  • Such an order may be defined automatically, for example based on a priority level associated with each sample, that may correspond to the decoding order.
  • the media server is optional in the sense that embodiments of the invention mainly deal with the description of encapsulated media files in order to provide information about data significance of encapsulated media data of the encapsulated media file, so that the encapsulated media data may be handled appropriately when they are transmitted and/or when they are received.
  • the transmission part (HTTP module and manifest parser) is optional in the sense that embodiments of the invention also apply for a media client consisting of a simple media player to which the encapsulated media file with its description is provided for rendering.
  • the media file can be provided by full download, by progressive download, by adaptive streaming or just by reading the media file on a disk or from a memory.
  • ordering of the samples can be done by a media packager such as media packager module 310 in Figure 3 and more specifically by ISOBMFF writer module 313 in cooperation with ordering unit 330, comprising software code, when executed by a microprocessor such as CPU 804 of the server apparatus illustrated in Figure 8.
  • a media packager such as media packager module 310 in Figure 3 and more specifically by ISOBMFF writer module 313 in cooperation with ordering unit 330, comprising software code, when executed by a microprocessor such as CPU 804 of the server apparatus illustrated in Figure 8.
  • the encapsulation module is in charge of reading high-level syntax of encoded timed media data bit-stream, e.g. composed of compressed video, audio or metadata, to extract and identify the different elementary units of the bit-stream (e.g. NALUs from a video bit-stream) and organizes encoded data in an ISOBMFF file or ISOBMFF segments 322 containing the encoded video bit-stream as one or more tracks, wherein the samples are ordered properly, with descriptive metadata according to the ISOBMFF box hierarchy.
  • Another example of encapsulation format can be the Partial File Format as defined in ISO/IEC 23001 -14. Signaling sample reordering using‘trun’ box
  • Figure 4 illustrates an example of processing carried out in media server 400 and in media client 450, according to embodiments.
  • a video stream is encoded in a video encoder (step 405), that may be similar to media encoder 300 in Figure 3.
  • the encoded video stream is provided to a packager, that may be similar to ISOBMFF writer 313 in Figure 3, to be encapsulated into a media file or into media segments (step 410).
  • the encapsulating step comprises a reordering step (step 415) during which the samples are reordered according to the needs of an application.
  • the encapsulated stream wherein the samples are ordered according to the second sample order, may be stored in a repository or in a server for later or live transmission (step 420), with their descriptive metadata allowing reorganization of the samples according to the first sample order is stored.
  • the transmission may use reliable protocols like HTTP or unreliable protocols like QUIC or RTP.
  • the transmission may be segment-based or chunk-based depending on the latency requirements.
  • the encoder and the packager may be implemented in the same device or in different devices. They may operate in real-time or with a low delay.
  • the packager re-encapsulates an already encapsulated file to change the encapsulation order of the samples, so as to fit with application needs or re-encapsulates the samples just before their transmission over a communication network.
  • Encapsulating step 410 comprising reordering step 415, aims at placing the encoded video stream into a data part of the file and at generating description metadata providing information on the track(s) as well as description about the samples.
  • the video stream may be encapsulated with other media resource or metadata, following the same principle: sample data is put in the data part (e.g.‘mdat’ box) of the media file or media segment and descriptive metadata (e.g.‘moov’,‘trak’ or‘moof ,‘traf ) are generated to describe how the sample data are actually organized within the data part.
  • Reordering step 415 reorders samples that are received in a first order, for example the order defined by the encoder, and reorganizes these samples according to a second order that is more convenient. For the sake of illustration, such convenience may be directed to the storage (wherein all the Intra are stored together, for example), the transmission (all the reference frames or all the base layers are transmitted first), or the processing (e.g. encryption or forward error correction) of these samples.
  • the second order may be defined according to a priority level or an importance order.
  • encapsulating step 410 may receive ordering information or a priority map from a user controlling the encapsulation process through a user interface, for example via the ordering unit 330 in Figure 3.
  • Such a priority map or ordering information may also be obtained from a video analytics module running on server 400 and analyzing the video stream. It may determine the relative importance of the video frames by carrying out a deep analysis of the video stream (e.g. by using NALU parser 31 1 in Figure 3) or by inspecting the high level syntax coming with the encoded video stream. It may determine the relative importance between the video frames because it is aware of the encoding parameters and configurations.
  • the packager may use a priority map accompanying this media file or even a priority map embedded within the media file, for example as a dedicated box.
  • a priority map provides relative importance information on the video samples or on the corresponding byte ranges of theses samples.
  • the packager may use information in the sample_flags parameter, when it is present to obtain information on dependencies (e.g. sample_is_depended_on) and degradation priority.
  • sample_is_depended_on is equal to 1 or when sample_is_non_sync_sample is equal to 0, the sample is considered as having high priority and may be stored at the beginning of the media data for the fragment.
  • it may use information from the SampleDependencyTypeBox or Degradation PriorityBox.
  • the sample_has_redundancy ⁇ s equal to 1
  • the sample is considered to have low priority and may be stored rather at the end of the media data for the fragment.
  • the sample_flags may be removed from the sample description to compact the fragment description.
  • the packager inserts ordering information within the descriptive metadata describing the samples, in terms of byte position (data_offset) and length (sample_size), duration, composition time (for example a‘trun’ box).
  • ordering information comprises an index of the samples in the data part, according to the second sample order. An example of reordering is described by reference to Figure 5b.
  • client 450 reads or receives an encapsulated stream (step 455), wherein the samples are ordered according to the second sample order.
  • the encapsulated stream is parsed (step 460), the parsing (or decapsulation) step comprising a reordering step (step 465) to reorganize the samples according to the first sample order so that the de-encapsulated stream can be decoded (step 470).
  • Figure 5a illustrates an example of dependencies of video frames, that are to be taken into account for coding or decoding a frame.
  • each video frame is represented with a letter and one or more digits where the letter represents the frame coding type and the digits represent the composition time of the video frame.
  • This frame organization is the classical B-hierarchical scheme from MPEG video compression codecs like HEVC. It is to be noted that it may be used for different types of l/P/B frames and prediction patterns.
  • the arrows between two video frames indicate that the frame at the start of the arrow is used to predict the frame at the end of the arrow.
  • frame B 3 depends on frames B 2 and B 4
  • frame B 2 depends on frame l 0 and frame B 4
  • frame B 4 depends on frames l 0 and P 8 .
  • frame B 3 can be decoded only after frames l 0 , P 3 , and B 4 , being noted that frame B 4 can be decoded only after frames l 0 and P 8 have been decoded and frame P 8 can be decoded only after frame / 0 .
  • Figure 5a also illustrates the layers or priority levels of each frame.
  • Figure 5b illustrates an example of reordering samples of a video stream during encoding and encapsulating steps.
  • the samples correspond to the frames illustrated in Figure 5a.
  • the samples are represented using the reference of the frames but it is to be understood that the samples in the streams are a sequence of bits, without explicit references to the frames.
  • the samples are ordered according to the position of the frames in the video stream. For example, the sample corresponding to frame l 0 is located before the sample corresponding to frame Bi that is located before the sample corresponding to frame B 2 because frame l 0 should be displayed before frame Bi that should be displayed before frame B 2 and so on.
  • the order of the encoded frames preferably depends on the dependencies of the frames. For example, the sample corresponding to frame B 4 should be received before the sample corresponding to frame B 2 , although frame B 4 is displayed after frame B 2 because frame B 4 is needed to decode frame B 2 .
  • the samples corresponding to the encoded frames are preferably ordered as a function of the dependencies of the frames, as illustrated with reference 505 that represents the encoded video streams.
  • the decoding order corresponds to the sample organization in encapsulated files or segments (for example in CMAF or ISOBMFF).
  • This sample order provides a compliant bit-stream for video decoders. Changing the sample order without any indication in the descriptive metadata may lead to non-compliant bit-streams after parsing and may crash video decoders.
  • the order of the encoded samples is advantageously modified, for example to make it possible to send the most important samples first.
  • An example of such sample reordering is illustrated with reference 510 in Figure 5b.
  • the samples are reordered according to the layers or priority levels (the frames corresponding to the first layer or to the first priority level are transmitted first, then the frames corresponding to the second layer or to the second priority level are transmitted and so on).
  • the parser When encoded samples are reordered, the parser must be aware of the modified encoded stream so as to make sure that the output of the parser is a bit-stream compliant with the decoder.
  • the indication of the order change is included in the descriptive metadata of the encapsulated file or segment.
  • such an indication may comprise a list of indexes of the position of the sample in the encapsulated stream (‘mdat’ box).
  • the parser may determine that the 10 th sample of the encoded stream corresponds to the 3 rd sample in the encapsulated stream (that is to say to location‘2’ if locations are indexed from 0). Therefore, by using the list of indexes, the parser may reconstruct the encoded stream from the encapsulated stream.
  • the indexes of the list of indexes correspond the location (the order) of the samples of the encapsulated stream in the encoded stream.
  • the sample description follows the order of the encapsulated stream 510 which requires the parser to reorder the samples data before transmission to the decoder.
  • This reordering uses the list of indexes providing the location (order) of the samples in the encoded stream.
  • the packager may include an indication at file level (e.g. in‘ftyp’ box) so that the parsers are able to identify which order (encapsulation or encoding) indexes are provided in the list of indexes 515.
  • the identification may be for example a brand or a compatible brand.
  • Such a brand indicates reordered samples and the actual order used by the packager.
  • a brand‘reor’ (this four character code is just an example, any reserved code not conflicting with other four character codes already in use can be used) indicates that samples are reordered.
  • a brand‘reoT indicates presence of reordering as in 515 (from decoding order to encapsulation order) while, still for example, a brand‘reo2’ indicates presence of reordering from encapsulation order 510 to decoding order 505.
  • This indication that encapsulation has been done with reordering may alternatively be included in the box indicating presence of movie fragments (e.g. the‘mvex’ or‘mehd’ box).
  • it may be an additional field (for example reordering type) in new versions of the‘mehd’: aligned(8) class MovieExtendsHeaderBox extends FullBox('mehd', version, 0) ⁇
  • a reordering type value set to 0 indicates that there is no reordering.
  • a reordering type value set to 1 there is reordering with a list of indexes providing mapping from decoding order 505 to encapsulation order 510 (as 515) and when it is set to 2, there is reordering with a list of indexes providing mapping from encapsulation order 510 to decoding order 505.
  • Other values are reserved.
  • the packager may decide to reorder only some fragments.
  • an encapsulated media file or segment may contain fragments with reordering and fragments without reordering.
  • the list of indexes is stored in a‘trun’ box (e.g. a standard‘trun’ box, a compact‘trun’ box, or a‘trun’ box relying on patterns) when the media file is fragmented.
  • a‘trun’ box e.g. a standard‘trun’ box, a compact‘trun’ box, or a‘trun’ box relying on patterns
  • the offset of the first byte for this sample in the ‘mdat’ box is computed from the corresponding index (by multiplying the value of the index by the size of a sample when sample_size is constant across all samples of the run).
  • the size of the sample is provided in the‘trun’ box through the sample_size field.
  • the sample duration and sample composition time offset may also be provided.
  • Figure 6a illustrates an example of steps for reordering samples of an encoded stream in an encapsulated stream. These steps may be carried out in a server, for example in a writer or a packager.
  • a first step is directed tojnitializing the packaging module (step
  • the packaging may be set to a particular reordering configuration by a user or through parameter settings during this step.
  • An item of information provided during this step is whether the encapsulation is to be done within fragments or not. When the encapsulation is to be done within fragments, the fragment duration is obtained.
  • These items of information are useful for the packaging module to prepare the box structure, comprising or not the‘moof and‘traf boxes. When samples are reordered, the packaging module sets the dedicated flags value in the‘trun’ box with the appropriate value, as described here after.
  • the packaging module receives the number of priority levels to consider and may allocate one sample list per level, except for the first level (for which samples will be directly written in the‘mdat’ box). Such an allocation may be made, for example, in memory 312 (in Figure 3) of the server.
  • the packaging module then initializes two indexes denoted index_first and index_second, respectively providing the sample index in the first order and the sample index in the second order. According to the given example, both indexes are initialized to zero.
  • a mapping table (as illustrated in Figure 6b) for reordering the samples is also allocated with its size corresponding to the number of samples expected per fragment (e.g. the sample_count related to the fragment duration) multiplied by one plus the number of levels.
  • mapping table 690 contains one table per level (denoted 692, 693, 694) and a mapping list (denoted 691 ) indexed by the sample index in the first order and providing access to the appropriate table per level (reference 692 or 693 or 694), given the sample level (e.g. i-th sample with level / is stored in mapping_table[l][i]).
  • Each cell of a table per level can store the sample description (e.g. reference 695) and the sample data (e.g. reference 696). It is to be noted that to allocate less memory, mapping table 690 may store the sample descriptions in mapping list 691 in addition to the level indication (instead of one of the 695 cell).
  • the packaging module reads the next sample to be processed (step 605), corresponding, at the beginning, to the first sample of the encoded stream.
  • index index_first is incremented by one (step 610).
  • the corresponding list (692, 693 or 694) in mapping_table[sample_level] is then selected at step 615.
  • the sample description is computed and stored in the mapping table (one of 695) at step 620.
  • the sample description may be, depending on the values of the‘trun’ box’s flags,“tr_flags”, the position of the first byte for the sample, the duration of the sample, the size of the sample, or the value of index index_second.
  • a composition_time_offset may also be provided.
  • the index index_second is initially set to zero.
  • This process iterates until the end of the fragment is not reached (i.e. the result of test 640 is false) by reading another sample from the encoded stream.
  • the packager flushes the data buffered in the mapping table 690 (step 640).
  • This mapping table is used for reordering samples.
  • index index_second for the stored samples is not known and still set to zero.
  • the packaging module starts flushing the data buffered during steps 620 and 625 (step 645) from a lower level (high priority or most important samples) to a higher level (low priority or less important samples) as follows: list table 691 is read and only the sample pertaining to the current level are candidate to flush (step 650).
  • sample data read from one of 696
  • sample description read from one of 695
  • descriptive metadata part for example in the‘trun’ box or in any box describing the run of samples.
  • the sample description contains, in addition to usual parameters (e.g. sample_duration, sample_size, etc.), the reordering information (or interleaving index) that is set to the current value of index index_second maintained by the packaging module (actually the position of the last written sample in the data part plus one).
  • the interleaving index shall be unique within the‘trun’ box.
  • the index index_second is incremented by one. The packager iterates over the level (test 660).
  • the‘trun’ box contains reordering for the samples of the fragment.
  • the packaging module finalizes the encapsulation of the fragment that is stored or transmitted, depending on the application.
  • Figure 7 illustrates an example of steps for reordering samples of an encapsulated stream into an encoded video stream. These steps are carried out in a client, at the reader end, for example in a parser.
  • a first step aims at receiving one encapsulated fragment (step 700).
  • the parser reads the descriptive metadata part (step 705), for example from the‘moof box and its sub-boxes (e.g.‘traf and‘trun’ boxes). Using this information, the parser determines whether reordering has to be performed or not (step 710). This may be done, for example, by checking the“tr flags”.
  • the run of samples can be extracted from the‘mdat’ box (step 715), sample after sample, from the first byte offset to the last byte, on a standard basis.
  • the last byte position is computed by cumulated sample sizes from the first sample to the last sample of the fragment as indicated by“sample_count” in the‘trun’ box.
  • the sample description is read (step 720) so as to read each sample index in the expected output order (step 725).
  • reordering information is read by reading the sample reordering information inserted in the‘trun’ box (for example in a standard‘trun’ box, in a compact ‘trun’ box, or in a‘trun’ box relying on patterns, as explained below).
  • the encapsulation step assigns an interleaving index k, in the range [0, trun::sample_count - 1 ], to each sample.
  • the trun has its data offset sets on a standard basis (e.g. beginning of the‘moof or the default_data_offset potentially present in the track fragment header box or the data_offset provided in the trun) and the sample_data_offset, relative to this data offset, for the current sample with interleave index k is computed (step 730) as follows:
  • the second order (encapsulated order) the index of this same sample in the first order (e.g. decoding order). These indexes are computed sample after sample by the parser during steps 725 and 730.
  • the parser can extract the number of bytes corresponding to sample_size and provide it to the decoder.
  • the parser iterates over all the samples (step 735) until the end of the fragment. The process continues until there is no more any fragment to process.
  • the extracted video stream can be processed by the video decoder.
  • the reordered fragments are described within the standard‘trun’ box.
  • the standard‘trun’ box is modified as follows. First a new flag value for the‘trun’ box is defined. For example, the value 0x100000 is reserved with the name SAMPLE INTERLEAVE BIT to indicate that samples data are stored in a different order than the decoding order. It is to be noted that the flag value here are provided as examples, any reserved value and name may be used provided that it does not conflict with other reserved flags values for the‘trun’ box.
  • the trun contains an additional parameter. This can be handled as a new version of the‘trun’ box, as follows (in bold):
  • class TrackRunBox extends FullBox('trun', version, tr_flags) ⁇
  • samplejnterleavejndex indicates the order of sample interleaving in the‘trun’ box.
  • a value of 0 indicates that the sample data start at the trun data offset.
  • a value of K>0 indicates that the sample data start at the trun data offset plus the sum of the size of all samples with an interleaving index strictly less than K. There shall not be two samples with the same interleaving index in the same‘trun’ box.
  • the reordered fragments are described with the compact‘trun’ box.
  • the compact‘trun’ box is modified as follows (in bold): aligned(8) class CompactTrackRunBox
  • sample_size [ sample_count - (first_sample_info_present ? 1 : 0) ];
  • samplejnterleavejndex indicates the order of sample interleaving in the trun.
  • a value of 0 indicates that the sample data start at the trun data offset.
  • a value of K>0 indicates that the sample data start at the trun data offset plus the sum of the size of all samples with an interleaving index strictly less than K. There shall not be two samples with the same interleaving index in the same trun. The semantics of the other parameters of the compact‘trun’ box remains unchanged.
  • the samplejnterleavejndex provides, for each sample in the run of samples, the index of the position of this sample in the media data part of the file or the segment (e.g. in‘mdat’ box).
  • the 32 bit sample count can be moved into a 30 bit field.
  • the sample count is still encoded on 32 bits and 8 more bits can be allocated to store the interleavejndex_size in the proposed box.
  • the parser may interpret it as meaning there is no need to apply a reordering step.
  • the parser has to read it from the descriptive metadata and to apply a reordering steps to make sure to produce a compliant bit stream for the video decoder.
  • the difference between the sample positions in the first and second orders is provided in the samplejnterleavejndex field, instead of providing the sample offset in the second order.
  • a delta from the decoding order may be used instead of a sample index.
  • the final layout after reordering (wherein, in B c,s t d , c is the composition order, t is the temporal layer, d is the decoding order, s storage order in the reordered trun) may be:
  • This mode may be advantageous for small reordering, i.e. when samples position in the second order (i.e. storage order) are not too far from the first order (i.e. the decoding order). It may not be suitable for samples from the end of the GOP being reordered at the beginning of the run and when samples at the beginning of the GOP being reordered at the end of the run.
  • the samplejnterleavejndex may be coded as a pattern identifier, as done in compact sample groups.
  • Bi g 2 TRUN TRUN B 23 2 TRUN B22 3 B 2 4 3 a pattern description may be the following:
  • the patterns may be encoded with 36 bits (2+6 + 4 * 2+6 + 3 * 2+6) and the values may be encoded with 64 bits (8x8), hence 13 bytes total.
  • the patterns may be encoded with 88 bits (8+8 + 4 * 8+8 + 3 * 8+8) and the values may be encoded with 64 bits (8x8), hence 19 bytes total.
  • the pattern approach since patterns of temporal levels are expected to repeat during the GOP. This requires the encoding of the pattern list within the reordered track run box (compact or not).
  • the pattern list may be declared in a companion box of the track run box. This allows mutualization of the pattern declarations, across segments of different tracks with reordered fragments or across segments of a single track with reordered fragments.
  • all the information describing the reordering comes in a new box associated to the trun box.
  • a variant for the indication of the number of bits in use for the sample_count and the samplejnterleavejndex consists in using a single flag value to indicate whether reordering is active or not. Then, the number of bits in use for the samplejnterleavejndex is the same number of bits than the one in use for the sample_count field. For the latter, considering that optimized or compact trun are dedicated to short fragments, using 16 bits instead of 32 bits is more relevant. For some very short fragments, sample_count could even be encoded on 8 bits. Then, another flag value is dedicated to indicate whether the sample_count in a compact trun box is encoded on 16 or 8 bits.
  • the inventors realized that when the sample description uses ‘trun’ box relying on patterns, the interleaving information could be added at no cost in such‘trun’ box. This is achieved by modifying the proposed TrackRunPatternStruct and adding a new flag to the TrackRunPatternBox. For example, the box containing the declaration of the patterns defines the following flag value:
  • bit(4) reserved 0;
  • samplejnterleavejndex indicates the order of sample interleaving in the‘trun’ box.
  • Value of 0 indicates that the sample data start at the‘trun’ data offset.
  • a value of K>0 indicates that the sample data start at the‘trun’ data offset plus the sum of the size of all samples with an interleaving index strictly less than K. There shall not be two samples with the same interleaving index in the same‘trun’ box.
  • This design has the advantage of not using any bit in the‘trun’ box relying on patterns to indicate the sample interleaving or reordering. For dynamic use cases (GOP structure varying), either a new template could be updated or the compact‘trun’ could be used.
  • - data_offset the number of bits for the description of the data_offset for a run of samples. According to embodiments, it is coded on a number of bits lower than the 32 bits used in the standard‘trun’ box. Indeed, the inventors have observed that both the pattern‘trun’ box and the compact‘trun’ box designs still use 32 bits for the data offset, which is very large in DASH/CMAF cases since the base offset is the start of the‘moof box. In most cases, 16 bits is more than enough. Therefore, according to embodiments, the possibility is given to use a smaller field, signaled through a new‘trun’ flags value.
  • the value 0x100000 may be defined and reserved for a flag value denoted“DATA OFFSET 16” that indicates, when set, that the data offset field is coded on a short number of bits (e.g., 16 bits). This flags value shall not be set if the data_offset_present flags value of the T rackRunBox is not set and the base-data-offset-present flags of the
  • TrackFragmentHeaderBox is not set
  • a method for encapsulating timed media data the media data being requested by a client, the method being carried out by a server and comprising:
  • the fragment comprising a set of continuous sample(s) of the timed media data
  • the metadata comprising structured metadata items (e.g., boxes),
  • a metadata item (e.g., the trun box) comprises a flag (e.g., DATA OFFSET 16) indicating whether a data offset is coded on a predetermined size or not; and,
  • - sample_counb the number of samples for a run of samples. According to embodiments, it is coded by a smaller number of bits since the fragments and then runs of samples are becoming smaller (less samples) for low latency purpose. This is for example the case of CMAF fragments that can correspond to 0.5 second of video, then“only” 15 samples for a video at 30 Hertz or 30 samples for a video at 60 Hertz. Using 32 bits is not so efficient since in most cases, a lower number of bits such as 8 or 16 bits would be sufficient.
  • the sample count field is of variable or configurable size. Moreover, the sample count from one fragment to another may remain the same. This may be the case when the GOP structure used by the video encoder is constant along over the time.
  • a default number of sample count can be defined and overloaded when necessary. Since this item of information may be useful for the whole file, it is set by the encapsulation or packaging module in a box of the initialization segment. It can be for example inserted in a new version of TrackExtendBox‘trex’ or in any box dedicated to the storage of default values used by the movie fragments:
  • class TrackExtendsBox extends FullBox('trex', version, 0) ⁇
  • This new‘trex’ box may be used with the compact‘trun’ box or the‘trun’ box relying on patterns.
  • sample_count information in the description of a run of samples may be indicated by a dedicated flags value of the‘trun’ box.
  • a method for encapsulating timed media data the media data being requested by a client, the method being carried out by a server and comprising:
  • the fragment comprising a set of continuous sample(s) of the timed media data
  • the metadata comprising structured metadata items (e.g., boxes), a metadata item of the structured metadata items (e.g., the trun pattern box) comprising a configurable parameter having a configurable size,
  • the metadata comprises an indication information (e.g., SAMPLE COUNT PRESENT) indicating whether a sample count field is present or not; and,
  • the box structures describing the fragments, especially the run of samples do not comprise flag value indicating presence or absence of some parameters. Instead, an exhaustive list of default values is defined for any parameter describing the run of samples, in a fragment. It can be done at the file level (e.g. trex_box or equivalent) when applying to the whole file. We call this variant the“exhaustive default values mode”.
  • the default value may be overloaded for a given fragment or even only for a given sample in the run of samples.
  • the parser Before reading the next parameter, the parser has to check presence or absence of the sample_size in the‘trun’ box. This requires checking whether the‘trun’ box has a predetermined value such as 0x000200 (for indication of sample-size-present) of its flags sets. If not, the parser has to further check whether the track fragment header box contains a default value for the sample size. Then, depending on the results of these tests, the parser may have to interpret a parameter in the‘trun‘ as the sample_size parameter for the current sample. This exhaustive default value mode with the variable number of bits between 0 to 32 for each parameter avoids carrying out these tests. By default, when parsing a sample, the default values are set.
  • the parser informed at the beginning of the trun parsing of the number of bits in use for each parameter, is able to determine how many bits to parse for a given parameter. This makes the description of run of samples simpler to parse and even more efficient. This variant may be used with the compact‘trun’ box or with‘trun’ boxes relying on patterns as described later in this invention.
  • the compact‘trun’ is used to describe the runs of samples within media fragments and is further improved by the optimizations discussed above.
  • the structure of the compact‘trun’ box is then modified as follows:
  • class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) ⁇
  • sample_size [ sample_count - (first_sample_info_present ? 1 : 0) ]; unsigned int( f( flags_size_ index) )
  • the data_offset may also be provided with a variable number of bits as an alternative to the flags value DATA OFFSET 16.
  • the description of the samples within the fragments i.e. the runs of samples, is further optimized as described below.
  • the description of the first sample is using 32 bits to encode the size and the sample_flags values for the first sample in the run of samples. This could be further optimized by using a variable number of bits for these items of information.
  • the packaging module determines the required number of bits and sets the actual value in use to describe the first sample inside the compact‘trun’ box.
  • the compact‘trun’ box is then modified as follows (with the semantics unchanged):
  • class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) ⁇
  • the compact‘trun’ box uses the“exhaustive default values” mode.
  • The‘trex’ box exhaustively defines default values for each field or parameter describing a run of samples.
  • the 2-bit fields for size ndex are exhaustive, i.e. all the fields in the compact‘trun’ box come with an indication of the number of bits used to encode them. This makes parsing simpler by avoiding some tests carried out on a sample basis.
  • the CompactTrackRunBox is modified as follows:
  • class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) ⁇
  • the interleave_index_size parameter becomes optional because it can be deduced as equal to the sample_count_size_index. This saves some flag values and avoids double declaration of a same value. This is because the signaling of the new sample positions after reordering, when coded as sample offsets, do not require more values than the number of samples (sample_count) declared in the track run.
  • the compact‘trun’ box uses the “exhaustive default values” mode and the sample description is provided in a loop on the samples (and not as a list of arrays as in the above): aligned (8) class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) ⁇
  • the compact‘trun’ box uses the“exhaustive default values” mode with a variable number of bits to encode the parameters, but these number of bits, instead of being defined in dedicated 2 bits codes are specified through the flags value of the compact‘trun’ box. This is possible because with the“exhaustive default values” mode, we don’t need flags anymore to indicate presence or absence of the parameters. This saves 16 bits per trun box.
  • the compact‘trun’ box then rewrites (with unchanged semantics, only bit length computation) as follows:
  • class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) ⁇
  • a 2bits_flags_value For a given parameter, a 2bits_flags_value is defined. The following function returns the actual number of bits in use for a given 2bits_flags_value.
  • switch(2bits_flags_value / 00000011) ⁇ // / is the binaty OR operator
  • the 2bits_flags_value may be defined:
  • sample_count_2bit_flags_value is one value in [OxOO, 0x01, 0x02, 0x03]
  • data_offset_2bit_flags_value is one value in [OxOO, 0x04, 0x08, OxC]
  • first_sample_size_2bit_flags_value is one value in [0x00, 0x10, 0x20, 0x30]
  • first_sample_flags_2bit_flags_value is one value in in [0x00, 0x40, 0x80, OxCO]
  • sample_duration_2bit_flags_value is one value in [0x00, 0x100, 0x200, 0x300]
  • sample_size_2bit_flags_value is one value in [0x00, 0x400, 0x800, OxCOO]
  • sample_flags_2bit_flags_value value is one value in [0x00, 0x1000, 0x2000, 0x3000]
  • composition_time_2bit_flags_value value is one value in the reserved value range [0x00, 0x4000, 0x8000, OxCOOO]
  • the 16-bit word formed by the tr flags value provide the number of bits used to represent each parameter of the compact‘trun’ box.
  • the ssix box potentially indexing the samples described by this optimized or compact trun box may also benefit from the following optimizations (for example using a new version of the ssix box or some flags value):
  • the optimized trun box may use variable size to encode the sample count. There cannot be more ranges declared in the ssix box (range_count field) than the number of samples declared in the track run (sample_count field).
  • the version 1 of the ssix also contains flags indicating the number of bits to use for the range_count. This number of bits may be set equal to the number of bits used to encode the sample_count in the optimized trun box (for example the sample_count_size_index value).
  • the ssix box may be modified as follows:
  • SubsegmentlndexBox extends FullBox('ssix', version, flags) ⁇
  • sample_count16_bits is a reserved value for the flags of a compact trun box (a one bit value as a variant to other variants called“sample_count_size_index” on a two bits value) and optionally of the new ssix box.
  • This reserved value indicates that, when set, sample_count field in the compact trun that is indexed by this new ssix box is coded on 16 bits. When not set, this sample_count field is encoded on 8 bits. When set, it can also be directly interpreted as the range_count of the ssix is described on 16 bits and when not set, it is encoded on 8 bits.
  • the compact trun When the compact trun has a one bit value for the flags reserved for the sample_count16_bits (instead of two bits value like sample_count_size_index), the one bit saved in the flags field of the compact trun can then be used to indicate the presence or absence of the sample ordering information (for example the samplejnterleavejoit) and the number of bits to use for samplejnterleavejndex is the same as the number of bits inferred from the sample_count16_bits flags.
  • the sample ordering information for example the samplejnterleavejoit
  • the Sample_flags information may be compacted in the‘trun’ box relying on patterns. This is useful as it enables storing sample_flags on 16 bits, getting rid of the sample_degradation_priorityt ⁇ e ⁇ d that is not used by most (if not all) sequences.
  • a new flag is introduced in the TrackRun Pattern Box to adapt the number of bits to represent the sample flags information:
  • numBitsPatternldx (nbm1_pattern_index + 1) * 8;
  • numBitsCTOffset (nbm1_ct_offset + 1) * 8;
  • numBitsSampleFlags (nbm1_sample_flags + 1) * 8;
  • the track run pattern struct is adapted accordingly:
  • a method for encapsulating timed media data the media data being requested by a client, the method being carried out by a server and comprising:
  • the fragment comprising a set of continuous sample(s) of the timed media data
  • the metadata comprising structured metadata items (e.g., boxes), a metadata item of the structured metadata items (e.g., the‘trun’ pattern box) describing samples using patterns and comprising a configurable parameter (e.g., SAMPLE FLAG) having a configurable size, wherein the configurable parameter provides characteristics (or properties) associated to a sample of the set of continuous samples; and,
  • a new flag value is defined for the‘trun’ box in pattern mode:
  • class TrackRunBox extends FullBox('trun', version, tr_flags) ⁇
  • patldx patjdx
  • InitSampleFlag ((trjlags & FIRST SAMPLE PRESENT) > 0);
  • bit(8-numBitslnLastByte) reserved 0;
  • GOP pattern Another consideration regarding the GOP pattern is that when a video sequence uses a fixed GOP pattern, it is common that the first sample (typically an IDR frame) of the GOP usually has a much larger frame size (than other Predicted or Bi- directional frames). In the meantime, the other properties (sample flags, CT offset, duration) are usually always the same from one sample to another.
  • the current design of the‘trun’ box relying on pattern makes provision for specific handling of the first sample in the‘trun’.
  • the pattern structure enables a per-sample number of bits to encode the size, which can be used to handle the first sample of the‘trun’ or the GOP if there are multiple GOPs in the trun (i.e. the pattern is repeated).
  • the pattern‘trun’ is simplified by removing all the first sample items of information (and related flags). This is simply done by looping on all the samples in the run instead of starting on the second one (i.e. instead of having specific signaling for the first sample) :
  • class TrackRunBox extends FullBox('trun', version, tr_flags) ⁇
  • patldx patjdx
  • bit(8-numBitslnLastByte) reserved 0;
  • the‘torn’ box relying on patterns is optimized by using a variable bit length for coding the data offset, for example indicated by a specific flag value in the ‘trun’ box.
  • a specific flag value for example, the value 0x100000 and the name “DATA_OFFSET_16” are reserved to indicate that when it is set, this value indicates that the data offset is coded on 16 bit.
  • This flag value shall not be set if the data_offset_present flags value of the TrackRunBox is not set and the base-data-offset- present flags of the TrackFragmentHeaderBox is not set.
  • The‘trun’ box comprising such an optimization then rewrites:
  • class TrackRunBox extends FullBox('trun', version, tr_flags) ⁇
  • patldx patjdx
  • patldx 0; unsigned int(numBitsSampleCount) sample_count_minus1 ;
  • bit(8-numBitslnLastByte) reserved 0;
  • the pattern currently using samplejlags for all samples is modified by using a FIRST SAMPLE FLAGS in the pattern definition, to use a full 32 bit for the first sample of the pattern:
  • The“exhaustive default values mode” variant can be used for‘trun’ boxes relying on patterns with the exhaustive list of default values, it can be defined in the pattern description.
  • the pattern itself may use some of these default values and the TrackRun Pattern Box is also modified to allow a null number of bits to support absence of one parameter without checking any flags value:
  • numBitsSampleCount (nbm1_sample_count & 2) * 16 + (nbm1_sample_count & 1) * 8;
  • numBitsCTOffset (nbm1_ct_offset&2) * 16 + (nbm1_ct_offset&1) * 8;
  • numBitsSampleSize (nbm1_sampie_size&2) * 16 + (nbm1_ sample_size&1) * 8
  • numBitsSampleFlags (nbm 1_sample_flags&2) * 16 + (nbm 1_ sample_flags& 1) * 8
  • numBitsDataOffset (nbm1_data_offset&2) * 16 + (nbm1_ data_offset& 1) * 8;
  • numBitsFirstSampleSize (nbm1_ first_sampie_size&2) * 16 + (nbm1_ first_sample_size &1) * 8;
  • numBitsSampleFirstSampleFlags (nbm1_ first_sample_flags&2) * 16 + (nbm1_
  • the TrackRunPatternStruct can be modified as follows, allowing a parser to avoid tests on presence or absence of parameters in the sample description:
  • bit(4) reserved 0;
  • class TrackRunBox extends FullBox('trun', version, tr_flags) ⁇
  • patldx patjdx
  • patldx 0; unsigned int(numBitsSampleCount) sample_count_minus1 ;
  • bit(8-numBitslnLastByte) reserved 0;
  • the number of bits in use to encode the parameters may be provided as a list of 2bits_flags_value.
  • the file format may carry within the metadata for sample description the composition time offsets (in ‘ctts’ box) to indicate a sample presentation time.
  • the sample presentation time may correspond to the composition time or may correspond to the composition time adjusted by one or more edit lists (described in‘elst’ box).
  • the composition time offset for a given sample is coded as the difference between the sample presentation time of this sample and the current sample delta (sum of the durations of the previous samples). This offset is usually coded on 32 bits (for example in the standard‘trun’ or in‘ctts’ boxes), or on a smaller number of bits (8 to 32 bits) in the compact trun box.
  • the sample composition time (CT) offset is expressed in media timescale, which for video usually is a multiple of the framerate or a large number (e.g. 90k or 1 M), resulting in large composition offsets, which can be quite heavy in terms of signalling.
  • CT composition time
  • an IBBP pattern repeated in a GOP may have the following decoding and composition times and offsets:
  • CT offset 10 30 0 0 30 0 0
  • a method for encapsulating timed media data the timed media data being requested by a client, the method being carried out by a server and comprising:
  • the fragment comprising a set of continuous sample(s) of the timed media data
  • the metadata comprising structured metadata items (e.g., boxes),
  • a metadata item of the structured metadata items comprises an indication information (e.g., SAMPLE CT FRAME) indicating whether a composition time offset parameter is coded as a function of a sample duration or not; and,
  • the indication of the composition_time_offset could be present in the sample description, for example as a specific flags value in the CompositionOffsetBox (‘ctts’):
  • sample-composition-time-offsets-frames when set, this indicates that the composition offset is coded as a multiple of sample duration, and shall be recomputed by multiplying the coded value by the sample duration. If not set, the composition offset is coded in timescale units. When the flags is set, the fields of the box uses half the bits than when this flags value is not set (16 bits instead of 32) to take benefit of shorter code for the sample_offset.
  • class CompositionOffsetBox extends FullBox('ctts', version, flags) ⁇
  • the trun box used to encapsulate the media fragments is a compact‘trun’ box
  • the following flags value is defined (in addition to existing ones) for the compact track run box. It is to be noted that this value, respectively the name, is just an example, any reserved or dedicated value, resp. name, not conflicting with existing flags value, resp. name can be used:
  • composition offset is coded as a multiple of sample duration (whatever the number of bits used), and shall be recomputed by multiplying the coded value by the sample duration.
  • the composition offset is coded in timescale units.
  • the packaging module 313 on figure 3 can be informed that the encoding is done with constant frame rate. In such case, it sets the flags value and provides the composition time offsets as a multiple of a sample duration, thus reducing the necessary number of bits.
  • an additional flags’ value is defined for the track run pattern box as follows, in addition to the other existing flags values: 0x100000 SAMPLE CT FRAME : when this bit is set, it indicates that the composition offset is coded as a multiple of sample duration, and shall be recomputed by multiplying the coded value by the sample duration. If not set, the composition offset is coded in timescale units.
  • this flags value is just an example, any reserved or dedicated value, respectively name, not conflicting with existing flag value, respectively name can be used.
  • the flag value indicating that the composition time offset is coded as a multiple of sample duration can be inferred in specific cases where the flags in the box hierarchy describing the fragments (e.g.‘moof or‘traf’) indicate a default_duration or that the sample_duration is not present in the‘trun’ box.
  • the media data to encapsulate come with additional or auxiliary data.
  • it is a depth item of information accompanying a video stream.
  • it is auxiliary data describing encryption parameter per sample as used by MPEG Common Encryption (CENC).
  • CENC MPEG Common Encryption
  • Fragmenting the media data and their auxiliary items of information may use the‘trun’ plus the‘saiz’ boxes to encapsulate these data in the same media file or media segment (as mandated for example in ISOBMFF).
  • the current syntax for‘saiz’ box is as follows:
  • the MPEG Common Encryption scheme uses auxiliary data describing encryption parameter per sample.
  • This information typically includes the Initialization Vectors (IV) for the whole sample, or the IV and a list of clear and encrypted byte ranges in the sample (subsample encryption).
  • IV Initialization Vectors
  • this information is empty and consequently omitted.
  • this information shall be signaled through the sample auxiliary mechanism, using‘saiz’ and‘saio’ boxes (in main movie or in movie fragments).
  • the size of auxiliary data can change in the following cases:
  • SEI Supplemental Enhancement Information
  • the protected samples will have an associated‘saiz’ entry different from 0, while the unprotected samples will have an associated‘saiz’ entry equal to 0. This may correspond to an area encrypted for privacy reason or to an area where you have to pay to see the content of a particular area of interest;
  • the variations can take various aspects: repeated pattern, single slot variations or burst of the same value.
  • the resulting sizes usually cover a well- defined set of values, representing all the possible encryption/encoding configurations.
  • auxiliary sample data size (default_sample_info_size) results in expanding the entire table, which is not very efficient. Therefore, according to particular embodiments, a new version of the‘saiz’ box is defined, enabling simple run-length encoding addressing most use cases, and pattern description for cases where pattern can be used.
  • entry count gives the number of entries in the box when version 1 is used;
  • sampie_count_in_entry gives the number of consecutive samples for which the si_rle_size applies. Samples are listed in decoding order. The same remarks as for sample_count applies;
  • si rie size gives the size in bytes of the sample auxiliary info size for the samples in the current entry
  • pattem count indicates the number of successive patterns in the pattern array that follows it.
  • the sum of the included sampie_pat_count [i] values indicates the number of mapped samples;
  • pattern_length [ i ] corresponds to a pattern within the second array of si_pat_size [ j ] values.
  • Each instance of pattem_iength [ i ] shall be greater than 0;
  • sampie_pat_count [ i ] specifies the number of samples that use the i th pattern; sampie_pat_count [ i ] shall be greater than zero, and sampie_pat_count [i] shall be greater than or equal to pattern_length [ i ] ;
  • si pat size [ j ] [k] is an integer that gives the size of the sample auxiliary info data for the samples in the pattern.
  • sample_pat_count[i] is greater than pattern_length[i]
  • the si_pat_size[i][] values of the i th pattern are used repeatedly to map the sample _pat_count[i] values. It is not necessarily the case that sample_pat_count[i] is a multiple of pattern_length[i]; the cycling may terminate in the middle of the pattern.
  • the total of the sample_pat_count[i] values for all values of i in the range of 1 to pattem count, inclusive, shall be equal to the total sample count of the track (if the box is present in the sample table) or of the track fragment.
  • a method for encapsulating timed media data the media data being requested by a client, the method being carried out by a server and comprising:
  • the fragment comprising a set of continuous sample(s) of the timed media data, and comprising auxiliary information associated to the continuous samples;
  • the metadata sub-item comprises a parameter determined as a function of a number of time a pattern is used; and, encapsulating the timed media data and the generated metadata.
  • the optimized version of the‘saiz’ box can be combined with any kind of ‘trun’ box: the standard‘trun’ box, the compact‘trun’ box, or the‘trun’ box relying on patterns.
  • the reordering indication can be combined with the further optimized compact‘trun’ box according to embodiments of the invention. It can be combined with a compact‘trun’ containing one of the proposed optimizations or all the proposed optimizations for better efficiency.
  • the encapsulated file or segment may further contain a compact‘saiz’ or‘saiz’ box according to embodiments of the invention.
  • the auxiliary data are advantageously placed at the beginning of the mdat.
  • the encryption information is always available whatever the number of video samples that is sent or received.
  • the new unit to describe the composition_time_offset may be used with reordering information, whatever the type of ‘trun’ box in use: standard, compact, or relying on patterns.
  • the encapsulated file or segment may further contain a compact ‘saiz’ or‘saiz’ box according to this invention.
  • the auxiliary data are advantageously placed at the beginning of the mdat.
  • the encryption information is always available whatever the number of video samples that is sent or received.
  • the reodering indication can be combined with the further optimized‘trun’ box relying on patterns according to embodiments of the invention.
  • the encapsulated file or segment may further contain a compact‘saiz’ or‘saiz’ box according to embodiments of the invention.
  • the auxiliary data are advantageously placed at the beginning of the mdat.
  • the encryption information is always available whatever the number of video samples that is sent or received..
  • the compact‘saiz’ box may be used with any version of the‘trun’ box: standard‘trun’, compact‘trun’ box, or‘trun’ box relying on patterns.
  • the compact‘saiz’ box may also be used when fragments are reordered as described with reference to Figures 4 to 7.
  • the auxiliary data are advantageously placed at the beginning of the mdat.
  • the encryption information is always available whatever the number of video samples that is sent or received.
  • a media presentation description may contain DASH index segments (or indexed segments) describing, in terms of byte ranges, encapsulated ISOBMFF movie fragments, wherein each subsegment comprises a mapping of levels (L0, L1 , L2) to byte ranges.
  • DASH index file 805 (that may be described in DASH as an index segment or as an indexed segment) 805 provides a mapping (e.g. sidx box 810) of time to byte ranges of encapsulated ISOBMFF segments 800.
  • a mapping e.g. ssix box 815) of levels (denoted L0, L1 , and L2, referenced 820) to byte ranges is provided, the levels being declared in a level assignment box denoted‘leva’ (not represented).
  • the level assignment may be based on the sample group (the assignment_type of the leva box is set to the value 0 and its grouping type is set, for example, to’tele’) describing sub-temporal layers and their dependencies.
  • the so- indexed sub-segments provide a list of byte ranges to get samples of a given sub temporal layer.
  • each level may be described as a SubRepresentation element (as illustrated in XML schema 825).
  • Figure 9 illustrates a first example of reordering and mapping samples having levels associated therewith.
  • the run of reordered samples 900 comprises first samples 905 that correspond to the most important samples for the decoding process. For example, they may correspond to random access points (RAP) or to reference pictures for other (less important) samples. These samples form a contiguous byte range (referenced 905).
  • RAP random access points
  • reference pictures for other (less important) samples.
  • the run of reordered samples 900 may contain other sets of samples such as the sets of samples 910 and 915, corresponding to samples having lower levels. Each of these sets corresponds to a contiguous byte range in the media data box of the file. A level is associated with each of them.
  • index 950 is a sub-segment index box according to ISOBMFF where each set of the corresponding samples 905, 910, and 915 is respectively mapped to a level.
  • each set of samples 905, 910, and 915 is associated with level 0, referenced 955, level 1 , referenced 960, or level 2, referenced 965.
  • the media file may contain, associated with one or more movie fragments (indicated by the field « subsegment_count » of the‘ssix’ box), an index such as index 950. This index provides the byte range to a given level of samples.
  • the range_count field of ssix 950 is set to 3, corresponding to the number of levels for the sample in the subsegment. This is useful to describe the one or more movie fragments in a streaming manifest, for example in a DASH media presentation description, as a set of alternative encoded versions of the one or more movie fragments.
  • An additional box denoted‘leva’ may also be available in the media file to indicate on which basis the levels are defined: for example on the basis of tracks, sub samples, or sample groups. This depends on the kind of levels.
  • the samples 905 would correspond to RAP (Random Access Point) samples while the remaining samples (i.e. the concatenation of samples 910 and 915) would correspond to a single byte range of non RAP samples, i.e. to the samples not mapped into the‘rap‘ sample group.
  • a first level may correspond to a region of interest and another level to the rest of the picture.
  • the samples associated with the level corresponding to the region of interest are considered as having a higher priority than the samples associated with the level corresponding to the rest of the picture.
  • the levels may, alternatively to trif sample groups, be mapped into HEVC tile subtracks. It may be also of interest to map the levels into the leva box onto layers that may be present in the video bit-stream.
  • the layer information is encapsulated as a sample group called‘linf for Layer information.
  • the assignment type of the leva box may be set to zero and its grouping type to the‘linf sample group.
  • a dedicated assignment_type may be used for the leva box to indicate that levels map to layers and provide as an additional parameter the four-character code of a box providing layer information.
  • Using such an assignment type makes it possible to map levels to layers independently of the codec in use, i.e. in a generic way. For example, assuming the reordering of a track containing two layers with each two temporal sub-layers, the levels of leva may be mapped as follows ⁇ LIDO, TIDO ⁇ , ⁇ LIDO, TID1 ⁇ , ⁇ LID1 , TIDO ⁇ , ⁇ LID1 , TID1 ⁇ where LID represents a layer identifier and TID represents a temporal sub layer identifier.
  • Ievel2 linf ⁇ LID0, TID1 ⁇
  • Ievel3 linf ⁇ LID1 , TIDO ⁇
  • level4 linf ⁇ LID1 , TID1 ⁇
  • Ievel2 linf ⁇ LID1 , TIDO ⁇
  • Ievel3 linf ⁇ LID0, TID1 ⁇
  • level4 linf ⁇ LID1 , TID1 ⁇
  • the level assignment may benefit from additional signaling regarding the dependencies between levels, for example to clarify when a given level depends on another level because it is not necessarily that a level N depends on level N-1 .
  • assignment type to a dedicated valur for layers, sub-layer or layer and sub-layer, an additional field is provided to indicate a list of dependent levels. This provides a complete map of levels with their dependencies.
  • a new assignment_type value is defined for the ‘leva’ box to indicate that the levels correspond to a reordered track run box (compact or not).
  • the range_count in the ‘ssix’ shall not be greater than the number of sample count declared in the corresponding track run box (compact or not).
  • the association between a trun and the ranges may be done by ssix::subsample_number that provides the index of the trun box. Having the trun reordered reduces the number of ranges to signal. For example, compared to the known DASH formats, each level for a given subsegment could be obtained by streaming clients, from a single request.
  • LevelAssignmentBox that provides a mapping from features, such as scalability layers, to levels (being noted that a feature can be specified through a track, a sub-track within a track, a sample grouping of a track, or a reordered track run of a track) may then be modified, in the‘leva’ box (section 8.8.13.2 of the DASH standard), as follows:
  • class LevelAssignmentBox extends FullBox('leva', 0, 0)
  • the semantics of the‘leva’ box is updated with a new assignment type value, for example the value 5.
  • the assignmenMype indicates the mechanism used to specify the assignment to a level.
  • the assignment_type values greater than 5 are reserved, while the semantics for the other values are specified as follows.
  • a sequence of assignmenMype is restricted to be a set of zero or more of type 2 or 3, followed by zero or more of exactly one type.
  • assignmenMype values is as follows:
  • sample groups are used to specify levels, i.e., samples mapped to different sample group description indexes of a particular sample grouping lie in different levels within the identified track; other tracks are not affected and shall have all their data in precisely one level;
  • ⁇ 2, 3 the level assignment is done by track (see the SubsegmentlndexBox for the difference in processing of these levels);
  • the respective level contains the samples for a sub-track.
  • the sub tracks are specified through the SubTrackBox; other tracks are not affected and shall have all their data in precisely one level; and
  • ⁇ 5 the respective level contains contiguous samples from a reordered track run.
  • the reordering is specified through ‘trun’ or ‘ctrn’ boxes having the interleave_index_size greater than 0 (for an indication based on a 2-bits flags value) or the samplejnterleavejoit set (for an indication based on a one bit flags value).
  • the new assignment type to reordered samples in a trun box leads to levels with distinct byte ranges. As such, some levels may have dependencies to other levels. In case, self contained or independent byte range is convenient (for example for single request addressing), another specific value of assignment type may be reserved. For example, when assignmenMype is set to six (or any value not already used for an assignemenMype), the respective levels contain contiguous samples from a reordered track run and each level is self-contained.
  • a first level with this assignmenMype would correspond to the set of samples 1005, a second level would correspond to the set of samples 1005 and 1010 and a third level would correspond to the whole set of samples.
  • the ssix box is then modified to allow overlapping byte ranges for a given level when the assignmenMype is set to 6. This avoids declaring dependencies in the streaming manifest and allows efficient access to a given level (single request for a given time interval). Since for this specific assignment type, there may be levels that are self_contained and levels that are not, the leva box provides in the declaration of levels the indication whether a given level is self-contained or not.
  • the leva box may be modified as follows:
  • class LevelAssignmentBox extends FullBox('leva', 0, 0)

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Security & Cryptography (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

According to embodiments, the invention provides a method for encapsulating encoded media data from a server to a client, the media data being requested by the client, the method being carried out by the server and comprising: obtaining samples of the encoded media data, the samples of the encoded media data being ordered according to a first ordering; and encapsulating, samples of the obtained samples, ordered according to a second ordering, the second ordering depending on a priority level associated with each of the obtained samples for processing the encapsulated samples upon decapsulation; and reordering information associated with the encapsulated samples for reordering the encapsulated samples according to the first ordering, upon decapsulation.

Description

METHOD, DEVICE, AND COMPUTER PROGRAM FOR IMPROVING
TRANSMISSION OF ENCODED MEDIA DATA
FIELD OF THE INVENTION
The present invention relates to methods and devices for improving transmission of encoded media data and to methods and devices for encapsulating and parsing media data.
BACKGROUND OF THE INVENTION
The invention is related to encapsulating, parsing and streaming media content, e.g. according to ISO Base Media File Format as defined by the MPEG standardization organization, to provide a flexible and extensible format that facilitates interchange, management, editing, and presentation of group of media content and to improve its delivery for example over an IP network such as Internet using adaptive http streaming protocol.
The International Standard Organization Base Media File Format (ISO BMFF, ISO/IEC 14496-12) is a well-known flexible and extensible format that describes encoded timed media data bitstreams either for local storage or transmission via a network or via another bitstream delivery mechanism. This file format has several extensions, e.g. Part-15, ISO/IEC 14496-15 that describes encapsulation tools for various NAL (Network Abstraction Layer) unit based video encoding formats. Examples of such encoding formats are AVC (Advanced Video Coding), SVC (Scalable Video Coding), HEVC (High Efficiency Video Coding) or L-HEVC (Layered HEVC). Another example of file format extensions is the Image File Format, ISO/IEC 23008-12, that describes encapsulation tools for still images or sequence of still images such as HEVC Still Image. This file format is object-oriented. It is composed of building blocks called boxes (or data structures characterized by a four character code) that are sequentially or hierarchically organized and that define descriptive parameters of the encoded timed media data bitstream such as timing and structure parameters. In the file format, the overall presentation over time is called a movie. The movie is described by a movie box (with four character code‘moov’) at the top level of the media or presentation file. This movie box represents an initialization information container containing a set of various boxes describing the presentation. It is logically divided into tracks represented by track boxes (with four character code‘trak’). Each track (uniquely identified by a track identifier (track l D)) represents a timed sequence of media data pertaining to the presentation (frames of video, for example). Within each track, each timed unit of data is called a sample; this might be a frame of video, audio or timed metadata. Samples are implicitly numbered in sequence. The actual sample data are in boxes called Media Data Boxes (with four character code‘mdat’) at same level than the movie box. The movie may also be fragmented, i.e. organized temporally as a movie box containing information for the whole presentation followed by a list of couple movie fragment and Media Data box. Within a movie fragment (box with four-character code‘moof) there is a set of track fragments (box with four character code‘traf), zero or more per movie fragment. The track fragments in turn contain zero or more track run boxes (with four character code ‘trun’), each of which documents a contiguous run of samples for that track fragment.
The MPEG Common Media Application Format (MPEG CMAF, ISO/IEC 23000-19) derives from ISOBMFF and provides an optimized file format for streaming delivery. CMAF specifies CMAF addressable media objects derived from encoded CMAF fragments, which can be referenced as resources by a manifest. A CMAF fragment is an encoded ISOBMFF media segment, i.e. one or more Movie Fragment Boxes (‘moof,‘traf, etc.) with their associated media data‘mdaf and other possible associated boxes. CMAF also defines CMAF chunk that is a single pair of ‘moof and ‘mdaf boxes. CMAF also defines CMAF segments that are addressable media resource containing one or more CMAF fragments.
Media data encapsulated with ISOBMFF or CMAF can be used for adaptive streaming with HTTP. For example, MPEG DASH (for“Dynamic Adaptive Streaming over HTTP”) and Smooth Streaming are HTTP adaptive streaming protocols that allow segment or fragment based delivery of media files. The MPEG DASH standard (see “ISO/IEC 23009-1 , Dynamic adaptive streaming over HTTP (DASH), Parti : Media presentation description and segment formats”) enables to create an association between a compact description of the content(s) of a media presentation and the HTTP addresses. Usually, this association is described in a file called a manifest file or description file. In the context of DASH, this manifest file is a file also called the MPD file (for Media Presentation Description). When a client device gets the MPD file, the description of each encoded and deliverable version of media content can easily be determined by the client. By reading or parsing the manifest file, the client is aware of the kind of media content components proposed in the media presentation and is aware of the HTTP addresses for downloading the associated media content components. Therefore, it can decide which media content components to download (via HTTP requests) and to play (decoding and play after reception of the media data segments). DASH defines several types of segments, mainly initialization segments, media segments or index segments. An initialization segments contain setup information and metadata describing the media content, typically at least the‘ftyp’ and‘moov’ boxes of an ISOBMFF media file. A media segment contains the media data. It can be for example one or more‘moof plus‘mdat’ boxes of an ISOBMFF or a byte range in the‘mdat’ box of an ISOBMFF. It can be for example a CMAF segment or an ISOBMFF segment. A Media Segment may be further subdivided into Subsegments (also corresponding to one or more complete‘moof plus‘mdaf boxes). The DASH manifest may provide segment URLs or a base URL to the file with byte ranges to segments for a streaming client to address these segments through HTTP requests. The byte range information may be provided by index segments or by specific ISOBMFF boxes like the Segment Index Box ‘sidx’ or the SubSegment Index Box‘ssix’.
In a classic adaptive streaming over HTTP session, it may happen that a client aborts the transfer of a media segment that cannot be delivered on time. This is especially true when working with low buffer levels. Clients usually handle this situation as follows:
- if enough time remains until due display time of the next segment, another lower quality of that segment is fetched. This may arise when the download cancel was performed early enough. The player can only hope to have enough time to fetch the alternate quality;
if not enough time remains, no alternative version of the segment is fetched.
In both cases, if the segment is not fully downloaded, the player either loses the entire segment or tries to decode what was received. This will result in a display freeze, whose duration depend on the amount of lost data.
In another scheme for adaptive streaming over an HTTP session, the media presentation description contains DASH index segments (or indexed segments) describing, in terms of byte ranges, the encapsulated ISOBMFF movie fragments. A mapping of time to byte ranges (sidx box) may be provided in DASH index segments (or indexed segments). For each subsegment, a mapping of levels (L0, L1 , L2) to byte ranges (ssix box) may be provided, the levels being declared in a level assignment box ‘leva’ box. The level assignment may be based on the sample group (the assignment_type of the leva box is set to the value 0 and its grouping type is set, for example, to ’tele’) describing sub-temporal layers. The so-indexed sub-segments provide a list of byte ranges to get samples of a given sub-temporal layer. When described in DASH, for streaming, each level may be described as a SubRepresentation element (e.g. an XML schema). However, as the byte-ranges for each level lead to multiple byte ranges, the access to a specific level is not optimal, because it requires multiple requests from the streaming clients.
The present invention has been devised to address one or more of the foregoing concerns and more generally to improve transmission of encoded media data.
SUMMARY OF THE INVENTION
According to a first aspect of the invention there is provided a method for encapsulating timed media data, the timed media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items, wherein a metadata item comprises a flag indicating whether a data offset is coded on a predetermined size or not, the data offset referring to the timed media data; and,
encapsulating the timed media data and the generated metadata.
Accordingly, the method of the invention makes it possible to optimize coding of the description data when encapsulating timed media data.
According to a second aspect of the invention there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items, a metadata item of the structured metadata items comprising a configurable parameter having a configurable size, wherein the metadata comprises an indication information indicating whether a sample count field is present or not; and, encapsulating the timed media data and the generated metadata.
Accordingly, the method of the invention makes it possible to optimize coding of the description data when encapsulating timed media data. According to a third aspect of the invention there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items, a metadata item of the structured metadata items comprising a configurable parameter having a configurable coding size, wherein the metadata comprises a flag indicating the coding size of the configurable parameter; and,
encapsulating the timed media data and the generated metadata.
Accordingly, the method of the invention makes it possible to optimize coding of the description data when encapsulating timed media data.
According to a fourth aspect of the invention there is provided a method for encapsulating timed media data, the timed media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items, wherein a metadata item of the structured metadata items comprises an indication information indicating whether a composition time offset parameter is coded as a multiple of a sample duration or of a time scale; and,
encapsulating the timed media data and the generated metadata.
Accordingly, the method of the invention makes it possible to optimize coding of the description data when encapsulating timed media data.
In an embodiment, the samples of the set of contiguous samples of the timed media data are ordered according to a first ordering, the samples of the set of contiguous samples are encapsulated according to a second ordering, the second ordering depending on a priority level associated with each of the samples of the set of contiguous samples for processing the encapsulated samples, upon decapsulation, and the generated metadata comprise reordering information associated with the encapsulated samples for re-ordering the encapsulated samples according to the first ordering, upon decapsulation.
In an embodiment, the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples. In an embodiment, each parameter value of the list is a position index, each position index being determined as a function of an offset and of the coding length of the obtained samples.
In an embodiment, the samples of the set of contiguous samples are encapsulated using the generated metadata.
In an embodiment, the method further comprises obtaining a priority map associated with the samples of the set of contiguous samples, the reordering information being determined as a function of the obtained priority map.
In an embodiment, the format of the encapsulated timed media data is of the ISOBMFF type or of the CMAF type.
According to a fifth aspect of the invention there is provided a method for encapsulating timed media data, the method comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data, the samples of the set of contiguous samples being ordered according to a first ordering;
generating metadata describing the obtained fragment; and
encapsulating samples of the set of contiguous samples and the generated metadata, samples of the set of contiguous samples being encapsulated according to a second ordering, the second ordering depending on a level associated with each of the samples of the set of contiguous samples for processing the encapsulated samples, upon decapsulation,
wherein the generated metadata comprise reordering information associated with the encapsulated samples for re-ordering the encapsulated samples according to the first ordering, upon decapsulation.
Accordingly, the method of the invention makes it possible to reduce description cost of fragmented media data, in particular of fragmented media data conforming ISOBMFF, and to provide a flexible organisation (reordering) of the media data (samples) with limited signalling overhead. Fragmenting the data and ordering the samples according to a level associated with each sample enable transmission of particular samples first. Such a level may be, for example, a temporal level, a spatial level, a quality level, a level directed to a region of interest, or a priority level.
According to a sixth aspect of the invention there is provided a method for encapsulating encoded media data, the method comprising:
obtaining samples of the encoded media data, the samples of the encoded media data being ordered according to a first ordering; and encapsulating samples of the obtained samples, ordered according to a second ordering, the second ordering depending on a priority level associated with each of the obtained samples for processing the encapsulated samples, upon decapsulation; and reordering information associated with the encapsulated samples for re- ordering the encapsulated samples according to the first ordering, upon decapsulation.
Accordingly, the method of the invention makes it possible to reduce description cost of fragmented media data, in particular of fragmented media data conforming ISOBMFF, and to provide a flexible organisation (reordering) of the media data (samples) with limited signalling overhead. Fragmenting the data and ordering the samples according to a priority level associated with each sample enable transmission of the most important samples first which leads to reducing freeze of video media display when temporal sublayers are split over fragments and transmission errors occur. In addition, the method of the invention makes it possible for the number of different byte ranges with different FEC (forward error correction) settings to be lowered, hence simplifying and improving the FEC part.
In an embodiment, the media data are timed media data and the obtained samples of the encoded media data correspond to a plurality of contiguous timed media data samples, the reordering information being encoded within metadata associated with the plurality of contiguous timed media data samples.
In an embodiment, the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
In an embodiment, each parameter value of the list is a position index, each position index being determined as a function of an offset and of the coding length of the obtained samples.
In an embodiment, the obtained samples are encapsulated using the metadata associated with the samples.
In an embodiment, the method further comprises obtaining a priority map associated with the obtained samples, the reordering information being determined as a function of the obtained priority map.
In an embodiment, obtaining samples of the encoded media data comprises obtaining samples of the media data and encoding the obtained samples of the media data.
In an embodiment, the priority levels are obtained from the encoding of the obtained samples of the media data. In an embodiment, the priority levels are determined as a function of dependencies between the obtained samples of the media data.
According to a seventh aspect of the invention there is provided a method for transmitting encoded media data from a server to a client, the media data being requested by the client, the method being carried out by the server and comprising encapsulating the encoded media data according to the method described above and transmitting, to the client, the encapsulated encoded media data.
The seventh aspect of the present invention has advantages similar to the first above-mentioned aspect.
According to a eighth aspect of the invention there is provided a method for processing encapsulated media data, the encapsulated media data comprising encoded samples and metadata, the metadata comprising reordering information, the method comprising:
obtaining,
samples of the encapsulated media data, the obtained samples of the encapsulated media data being ordered according to a second ordering; and
reordering information; and
reordering the obtained samples in a first ordering according to the obtained reordering information, the first ordering making it possible for the obtained samples to be decoded.
The eighth aspect of the present invention has advantages similar to the first above-mentioned aspect.
In an embodiment, the media data are timed media data and the obtained samples of the encapsulated media data corresponds to a plurality of contiguous timed media data samples, the reordering information being encoded within metadata associated with the plurality of contiguous timed media data samples.
In an embodiment, the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
In an embodiment, reordering the obtained samples comprises computing offsets as a function of the parameter values and of coding lengths of the encoded samples.
In an embodiment, the method further comprises decoding the reordered samples.
In an embodiment, the method is carried out in a client, the samples of the encapsulated media data and the reordering information being received from a server. In an embodiment, the format of the encapsulated media data is of the ISOBMFF type or of the CMAF type.
According to a ninth aspect of the invention there is provided a signal carrying an information dataset for media data, the information dataset comprising encapsulated encoded media data samples and reordering information, the reordering information comprising a description of an order of samples for decoding the encoded samples.
The ninth aspect of the present invention has advantages similar to the first above-mentioned aspect.
According to a tenth aspect of the invention there is provided a media storage device storing a signal carrying an information dataset for media data, the information dataset comprising encapsulated encoded media data samples and reordering information, the reordering information comprising a description of an order of samples for decoding the encoded samples.
The tenth aspect of the present invention has advantages similar to the first above-mentioned aspect.
According to a eleventh aspect of the invention there is provided a device for transmitting or receiving encapsulated media data, the device comprising a processing unit configured for carrying out each of the steps of the method described above.
The eleventh aspect of the present invention has advantages similar to the first above-mentioned aspect.
At least parts of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit", "module" or "system". Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.
BRIEF DESCRIPTION OF THE DRAWINGS
Further advantages of the present invention will become apparent to those skilled in the art upon examination of the drawings and detailed description. It is intended that any additional advantages be incorporated herein.
Embodiments of the invention are described below, by way of examples only, and with reference to the following drawings in which:
Figure 1 illustrates the general architecture of a system comprising a server and a client exchanging HTTP messages;
Figure 2 describes the protocol stack according to embodiments of the invention;
Figure 3 illustrates a typical client server system for media streaming according to embodiments of the invention;
Figure 4 illustrates an example of processing carried out in media server and in media client, according to embodiments;
Figure 5a illustrates an example of dependencies of video frames, that are to be taken into account for coding or decoding a frame;
Figure 5b illustrates an example of reordering samples of a video stream during encoding and encapsulating steps;
Figure 6a illustrates an example of steps for reordering samples of an encoded stream in an encapsulated stream;
Figure 6b is an example of a data structure used for reordering samples;
Figure 7 illustrates an example of steps for reordering samples of an encapsulated stream in an encoded video stream;
Figure 8 illustrates DASH index segments (or indexed segments) describing, in terms of byte ranges, encapsulated ISOBMFF movie fragments, wherein each subsegment comprises a mapping of levels to byte ranges;
Figures 9 and 10 illustrate examples of reordering and mapping of samples having levels associated therewith; and
Figure 11 schematically illustrates a processing device configured to implement at least one embodiment of the present invention DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
A video bit-stream is usually organized into Group Of Pictures (GOP). This is the case for example with MPEG video compression standards like AVC/H.264 or HEVC/H.265. For example, a classical l°P°B1 B2B3 layout at 25 frames per second, with BN having no dependencies on BN+1 , N being an indication of a level, e.g. temporal layer or scalability layer or set of samples with a given priority level, is often encapsulated in a media file or media segments in the decoding order, for example as follows for a one second video:
li° P25° Bs1 B3 2 B2 3 B43 Bg1 B7 2 Be3 B8 3 BI3 1 Bn2 Bio3 BI2 3 BI7 1 Bis2 BI4 3 BI6 3 B2I 1 Big2 Bi
B20 3 B23 2 B22 3 B 24
where the indicia indicates the composition or presentation order.
To improve robustness or the user experience, the inventors have observed that some applications benefit from a different sample organization in the data part of the media file (‘mdat’ box). An expected layout may be based on priority level values, as follows:
li° P 25° Bs1 B9 1 BI S 1 BI 7 1 B2I 1 B3 2 B7 2 Bn2 BI S 2 BI 9 2 B23 2 B2 3 B4 3 Be3 B8 3 Bio3 BI 2 3 BI 4 3 BI 6 3
BI8 3 B2Q 3 B22 3 B 24
To achieve such an organization of the samples using the standard‘trun’ box, one‘trun’ may be used each time the sample continuity is broken, i.e. each time a sample in the expected layout (ordered according to the priority level values) is not the sample following the previous sample in decoding order, as follows:
TRUN li° P25° B5 1 TRUN B3 2 TRUN B2 3 B4 3 TRUN B9 1 TRUN B7 2 TRUN B6 3 B8 3 TRUN Bis1 TRUN B^ 2 TRUN ^o3 BI2 3 TRUN B^ TRUN B^2 TRUN B 3 BI6 3 TRUN B21 1 TRUN BI9 2 TRUN BI8 3 B20 3 TRUN B23 2 TRUN B22 3 B24 3
However, it is apparent from such a use of the standard‘trun’ box that it induces a non-negligible extra description cost (in this case 17 x 20 = 340 bytes, close to 3 kbits/s). This extra description cost also appears for media resource where the access units have smaller and smaller sizes. This is the case for example when a video stream is spatially split into sub picture tracks or tiles.
To cope with such a problem, the‘trun’ box is improved, in particular by adding a sample processing order that avoids such a repetition of the‘trun’ box.
It is to be recalled that different types of‘trun’ boxes are defined. The ISO/IEC 14496-12 defines the Track Extends Box to define default values used by the movie fragments in a media file or in a set of media segments. The track fragment header box ‘tfhd’ also sets up information and default values for the runs of samples contained in one movie fragment. A run of samples is described in a‘trun’ box as one or more parameters such as a number of samples, an optional a data offset, optional dependency information related to the first sample in the run, and, for each sample in the run, optional sample_duration, sample_size, sample_flags (for dependency/priority), and composition time information. While this may be relevant for sample duration, it is not adapted for sample size, especially when samples correspond to video frames. The MPEG group defining the ISO/IEC 14496-12 is considering further improvements of the‘trun’ box. A compact version of the‘trun’ box (‘ctrn’) is under definition. This compact‘trun’ box changes the standard‘trun’ box as follows:
(a) it provides some configurable field sizes (where the standard ‘trun’ systematically used 32 bits per parameter);
(b) it changes the loop structure to being a struct that contains a set of arrays, rather than an array of a struct with varying-size fields; and
(c) it allow for first-sample variation of all fields, not just the sample_flags parameter (as for example, the first sample_size may be much larger).
A track fragment may contain zero or more standard‘trun’ or compact trun ‘ctrn’ boxes.
In addition to the compact‘trun’ box‘ctrn’, new versions of the‘trun’ box are also in consideration in the MPEG, in particular‘trun’ boxes relying on patterns. These versions provide the following features:
- a capability for indicating the sample flags, sample duration, sample size, and sample composition time offset, each with a configurable number of bytes, for the first sample of a track run. The use of this feature can be controlled with a flags value of the TrackRunBox (i.e. the tr flags) and
- providing one or more track run patterns (in a new TrackRun Pattern Box identifief by the four character code ‘trup’) of per-sample metadata is provided in MovieExtendsBox or MovieFragmentBox. The TrackRun Pattern Box enables cyclic assignment of repetitive track run patterns to samples of track runs. One or more track run patterns is specified in the‘trup’ box. For each sample in a track run pattern, the sample_duration, samplejlags, sample_composition_time_offset and the number of bits to encode the sample_size are conditionally provided depending on the box flags.
According to embodiments, a sample processing order is indicated in an encapsulated media data file or in a companion file (e.g. a companion file referencing an encapsulated media data file) to give information about data significance of encapsulated data of the encapsulated media data file, the encapsulated data typically comprising media data and descriptive metadata, so that these encapsulated data may be handled appropriately.
The sample processing order may be used at the server end to organise samples of a fragment of encoded media data, according to their priority, for example for transmitting the most important samples first. At the client end, the sample processing order is used to parse a received encapsulated stream and to provide a decodable stream.
The encapsulated media data may be directed to different kinds of media resources or media components such as an image sequence, one or more video tracks with or without associated audio tracks, auxiliary or metadata tracks.
According to embodiments, the sample processing order associated with a file comprising encapsulated media data are defined in the‘trun’ box.
Figure 1 illustrates the general architecture of a system comprising a server and a client exchanging HTTP messages. As illustrated, the client denoted 100 sends an HTTP message denoted 140 to the server denoted 1 10, through a connection denoted 130 established over a network denoted 120.
According to HTTP, the client sends an HTTP request to the server that replies with an HTTP response. Both HTTP request and HTTP response are HTTP messages. For the sake of illustration, HTTP messages can be directed to the exchange of media description information, the exchange of media configuration or description, or the exchange of actual media data. The client may thus be a sender and a receiver of HTTP messages. Likewise, the server may be a sender and a receiver of HTTP messages.
No distinction is made hereafter between HTTP requests and HTTP responses. However, it is generally expected that HTTP requests are sent on a reliable basis while some HTTP responses may be sent on an unreliable basis. Indeed, a common use-case for the unreliable transmission of HTTP messages corresponds to the case according to which the server sends back to the client a media stream in an unreliable way. However, in some cases, the HTTP client could also send an HTTP request in an unreliable way, for example for sending a media stream to the server. At some point, the HTTP client and the HTTP server can also negotiate that they will run in a reliable mode. In such a case, both HTTP requests and responses are sent in a reliable way.
Figure 2 illustrates an example of protocol stacks of a server 200, for example client 100 of Figure 1 , and of a server 250, for example server 1 10 of Figure 1.
The same protocol stack exists on both client 200 and server 250, making it possible to exchange data through a communication network.
At the client’s end (200), the protocol stack receives, from application 205, a message to be sent through the network, for example message 140. At the server’s end (250), the message is received from the network and, as illustrated, the received message is processed at transport level 275 and then transmitted up to application 255 through the protocol stack that comprises several layers.
At the client’s end, the protocol stack contains the application, denoted 205, at the top level. For the sake of illustration, this can be a web application, e.g. a client part running in a web browser. In a particular embodiment, the application is a media streaming application, for example using DASH protocol, to stream media data encapsulated according to ISO Base Media File Format. Underneath is an HTTP layer denoted 210, which implements the HTTP protocol semantics, providing an API (application programming interface) for the application to send and receive messages. Underneath is a transport adaptation layer (TA layer or TAL). The TAL may be divided into two sublayers: a stream sublayer denoted 215 (TAL-stream, TA Stream sublayer, or TAS sublayer) and a packet sublayer denoted 220 (TAL-packet, TA Packet sublayer, or TAP sublayer), depending on whether the transport layer manipulates streams and packets or only packets. These sublayers enable transport of HTTP messages on top of the UDP layer denoted 225.
At the server’s end, the protocol stack contains the same layers. For the sake of illustration, the top level application, denoted 255, may be the server part running in a web server. The HTTP layer denoted 260, the TAS sublayer denoted 265, the TAP sublayer denoted 270, and the UDP layer denoted 275 are the counterparts of the layers 205, 210, 215, 220, and 225, respectively.
From a physical point of view, an item of information to be exchanged between the client and the server is obtained at a given level at the client’s end. It is transmitted through all the lower layers down to the network, is physically sent through the network to the server, and is transmitted through all the lower layers at the server’s end up to the same level as the initial level at the client’s end. For example, an item of information obtained at the HTTP layer from the application layer is encapsulated in an HTTP message. This HTTP message is then transmitted to TA stream sublayer 215, which transmits it to TA Packet sublayer 220, and so on down to the physical network. At the server’s end, the HTTP message is received from the physical network and transmitted to TA Packet sublayer 270, through TA Stream sublayer 265, up to HTTP layer 260, which decodes it to retrieve the item of information so as to provide it to application 255.
From a logical point of view, a message is generated at any level, transmitted through the network, and received by the server at the same level. From this point of view, all the lower layers are an abstraction that makes it possible to transmit a message from a client to a server. This logical point of view is adopted below.
According to embodiments, the transport adaptation layer (TAL) is a transport protocol built on top of UDP and targeted at transporting HTTP messages.
At a higher level, TAS sublayer provides streams that are bi-directional logical channels. When transporting HTTP messages, a stream is used to transport a request from the client to the server and the corresponding response from the server back to the client. As such, a TA stream is used for each pair of request and response. In addition, one TA stream associated with a request and response exchange is dedicated to carrying the request body and the response body.
All the header fields of the HTTP requests and responses are carried by a specific TA stream. These header fields may be encoded using HPACK when the version of HTTP in use is HTTP/2 (HPACK is a compression format for efficiently representing HTTP header fields, to be used in HTTP/2).
To transfer data belonging to a TA stream, data may be split into TA frames. One or more TA frames may be encapsulated into a TA packet which may itself be encapsulated into a UDP packet to be transferred between the client and the server. There are several types of TA frames, the STREAM frames carry data corresponding to TA streams, the ACK frames carry control information about received TA packets, and other frames are used for controlling the TA connection. There are also several types of TA packets, one of those being used to carry TA frames.
Figure 3 illustrates an example of a client-server system wherein embodiments of the invention may be implemented. It is to be noted that the implementation of the invention is not limited to such a system as it may concern the generation of media files that may be distributed in any way, not only by streaming over a communication network but also for local storage and rendering by a media player. As illustrated, the system comprises, at the server’s end, media encoders 300, in particular a video encoder, a media packager 310 to encapsulate data, and a media server 320. According to the illustrated example, media packager 310 comprises a NALU (NAL Unit) parser 31 1 , a memory 312, and an ISOBMFF writer 313. It is to be noted that the media packager 310 may use a file format other than ISOBMFF. The media server 320 can generate a manifest file (also known as a media presentation description (MPD) file)) 321 and media segments 322.
A the client’s end, the system further comprises media client 350 having ISOBMFF parser 352, media decoders 353, in particular a video decoder, a display 354, and an HTTP client 351 that supports adaptive HTTP streaming, in particular parsing of streaming manifest, denoted 359, to control the streaming of media segments 390. According to the illustrated example, media client 350 further contains transformation module 355 which is a module capable of performing operations on encoded bit-streams (e.g. concatenation) and/or decoded picture (e.g. post-filtering, cropping, etc.).
Typically, media client 350 requests manifest file 321 in order to get the description of the different media representations available on media server 320, that compose a media presentation. In response to receiving the manifest file, media client 350 requests the media segments (denoted 322) it is interested in. These requests are made via HTTP module 351 . The received media segments are then parsed by ISOBMFF parser 352, decoded by video decoder 353, and optionally transformed or post-processed in transformation unit 355, to be played on display 354.
A video sequence is typically encoded by a video encoder of media 300, for example a video encoder of the H.264/AVC or H.265/HE VC type. The resulting bit- stream is encapsulated into one or several files by media packager 310 and the generated files are made available to clients by media server 320.
According to embodiments of the invention, the system further comprises an ordering unit 330 that may be part of the media packager or not. The ordering unit aims at defining the order of the samples so as to optimize the transmission of a fragment. Such an order may be defined automatically, for example based on a priority level associated with each sample, that may correspond to the decoding order.
It is to be noted that the media server is optional in the sense that embodiments of the invention mainly deal with the description of encapsulated media files in order to provide information about data significance of encapsulated media data of the encapsulated media file, so that the encapsulated media data may be handled appropriately when they are transmitted and/or when they are received. As for the media server, the transmission part (HTTP module and manifest parser) is optional in the sense that embodiments of the invention also apply for a media client consisting of a simple media player to which the encapsulated media file with its description is provided for rendering. The media file can be provided by full download, by progressive download, by adaptive streaming or just by reading the media file on a disk or from a memory.
According to embodiments, ordering of the samples can be done by a media packager such as media packager module 310 in Figure 3 and more specifically by ISOBMFF writer module 313 in cooperation with ordering unit 330, comprising software code, when executed by a microprocessor such as CPU 804 of the server apparatus illustrated in Figure 8.
Typically, the encapsulation module is in charge of reading high-level syntax of encoded timed media data bit-stream, e.g. composed of compressed video, audio or metadata, to extract and identify the different elementary units of the bit-stream (e.g. NALUs from a video bit-stream) and organizes encoded data in an ISOBMFF file or ISOBMFF segments 322 containing the encoded video bit-stream as one or more tracks, wherein the samples are ordered properly, with descriptive metadata according to the ISOBMFF box hierarchy. Another example of encapsulation format can be the Partial File Format as defined in ISO/IEC 23001 -14. Signaling sample reordering using‘trun’ box
As described above, these exist several types of the‘trun’ box among which the standard version, the compact version, and the‘trun’ box using patterns. While signaling of the sample reordering depends on the type of the‘trun’ box used, the reordering of the samples itself, by the server or the client, is the same for all the types of‘trun’ box.
Figure 4 illustrates an example of processing carried out in media server 400 and in media client 450, according to embodiments.
As illustrated, a video stream is encoded in a video encoder (step 405), that may be similar to media encoder 300 in Figure 3. The encoded video stream is provided to a packager, that may be similar to ISOBMFF writer 313 in Figure 3, to be encapsulated into a media file or into media segments (step 410). As illustrated, the encapsulating step comprises a reordering step (step 415) during which the samples are reordered according to the needs of an application. The encapsulated stream, wherein the samples are ordered according to the second sample order, may be stored in a repository or in a server for later or live transmission (step 420), with their descriptive metadata allowing reorganization of the samples according to the first sample order is stored. The transmission may use reliable protocols like HTTP or unreliable protocols like QUIC or RTP. The transmission may be segment-based or chunk-based depending on the latency requirements.
The encoder and the packager may be implemented in the same device or in different devices. They may operate in real-time or with a low delay.
According to particular embodiments, the packager re-encapsulates an already encapsulated file to change the encapsulation order of the samples, so as to fit with application needs or re-encapsulates the samples just before their transmission over a communication network.
Encapsulating step 410, comprising reordering step 415, aims at placing the encoded video stream into a data part of the file and at generating description metadata providing information on the track(s) as well as description about the samples. The video stream may be encapsulated with other media resource or metadata, following the same principle: sample data is put in the data part (e.g.‘mdat’ box) of the media file or media segment and descriptive metadata (e.g.‘moov’,‘trak’ or‘moof ,‘traf ) are generated to describe how the sample data are actually organized within the data part.
Reordering step 415 reorders samples that are received in a first order, for example the order defined by the encoder, and reorganizes these samples according to a second order that is more convenient. For the sake of illustration, such convenience may be directed to the storage (wherein all the Intra are stored together, for example), the transmission (all the reference frames or all the base layers are transmitted first), or the processing (e.g. encryption or forward error correction) of these samples. For example, the second order may be defined according to a priority level or an importance order. Still for example, encapsulating step 410 may receive ordering information or a priority map from a user controlling the encapsulation process through a user interface, for example via the ordering unit 330 in Figure 3. Such a priority map or ordering information may also be obtained from a video analytics module running on server 400 and analyzing the video stream. It may determine the relative importance of the video frames by carrying out a deep analysis of the video stream (e.g. by using NALU parser 31 1 in Figure 3) or by inspecting the high level syntax coming with the encoded video stream. It may determine the relative importance between the video frames because it is aware of the encoding parameters and configurations.
When the encapsulating step is directed to re-encapsulating media files previously encapsulated according to a first sample order, the packager may use a priority map accompanying this media file or even a priority map embedded within the media file, for example as a dedicated box. Typically, a priority map provides relative importance information on the video samples or on the corresponding byte ranges of theses samples. Alternatively, when present in the sample description, the packager may use information in the sample_flags parameter, when it is present to obtain information on dependencies (e.g. sample_is_depended_on) and degradation priority. For example, when sample_is_depended_on is equal to 1 or when sample_is_non_sync_sample is equal to 0, the sample is considered as having high priority and may be stored at the beginning of the media data for the fragment. Alternatively, it may use information from the SampleDependencyTypeBox or Degradation PriorityBox. For example, when the sample_has_redundancy \s equal to 1 , the sample is considered to have low priority and may be stored rather at the end of the media data for the fragment. Once reordered according to priorities, the sample_flags may be removed from the sample description to compact the fragment description.
An example of reordering steps is described in reference to Figure 6a.
According to this example, the packager inserts ordering information within the descriptive metadata describing the samples, in terms of byte position (data_offset) and length (sample_size), duration, composition time (for example a‘trun’ box). According to embodiments, ordering information comprises an index of the samples in the data part, according to the second sample order. An example of reordering is described by reference to Figure 5b.
Conversely, client 450 reads or receives an encapsulated stream (step 455), wherein the samples are ordered according to the second sample order. The encapsulated stream is parsed (step 460), the parsing (or decapsulation) step comprising a reordering step (step 465) to reorganize the samples according to the first sample order so that the de-encapsulated stream can be decoded (step 470).
Figure 5a illustrates an example of dependencies of video frames, that are to be taken into account for coding or decoding a frame.
For the sake of illustration, each video frame is represented with a letter and one or more digits where the letter represents the frame coding type and the digits represent the composition time of the video frame. This frame organization is the classical B-hierarchical scheme from MPEG video compression codecs like HEVC. It is to be noted that it may be used for different types of l/P/B frames and prediction patterns. The arrows between two video frames indicate that the frame at the start of the arrow is used to predict the frame at the end of the arrow. For example, frame B3 depends on frames B2 and B4, frame B2 depends on frame l0 and frame B4, and frame B4 depends on frames l0 and P8. Accordingly, frame B3 can be decoded only after frames l0, P3, and B4, being noted that frame B4 can be decoded only after frames l0 and P8 have been decoded and frame P8 can be decoded only after frame /0.
Figure 5a also illustrates the layers or priority levels of each frame.
Figure 5b illustrates an example of reordering samples of a video stream during encoding and encapsulating steps. According to this example, the samples correspond to the frames illustrated in Figure 5a. For the sake of clarity, the samples are represented using the reference of the frames but it is to be understood that the samples in the streams are a sequence of bits, without explicit references to the frames.
As it is apparent from video stream 500, the samples are ordered according to the position of the frames in the video stream. For example, the sample corresponding to frame l0 is located before the sample corresponding to frame Bi that is located before the sample corresponding to frame B2 because frame l0 should be displayed before frame Bi that should be displayed before frame B2 and so on.
In order to optimize coding and decoding, the order of the encoded frames preferably depends on the dependencies of the frames. For example, the sample corresponding to frame B4 should be received before the sample corresponding to frame B2, although frame B4 is displayed after frame B2 because frame B4 is needed to decode frame B2.
Therefore, the samples corresponding to the encoded frames are preferably ordered as a function of the dependencies of the frames, as illustrated with reference 505 that represents the encoded video streams.
Usually, the decoding order corresponds to the sample organization in encapsulated files or segments (for example in CMAF or ISOBMFF). This sample order provides a compliant bit-stream for video decoders. Changing the sample order without any indication in the descriptive metadata may lead to non-compliant bit-streams after parsing and may crash video decoders.
As described above, there exist cases for which the order of the encoded samples is advantageously modified, for example to make it possible to send the most important samples first. An example of such sample reordering is illustrated with reference 510 in Figure 5b. In this example, the samples are reordered according to the layers or priority levels (the frames corresponding to the first layer or to the first priority level are transmitted first, then the frames corresponding to the second layer or to the second priority level are transmitted and so on).
When encoded samples are reordered, the parser must be aware of the modified encoded stream so as to make sure that the output of the parser is a bit-stream compliant with the decoder.
According to embodiments, the indication of the order change, also referred to as interleaving (or reordering) information, is included in the descriptive metadata of the encapsulated file or segment. As illustrated with reference 515, such an indication may comprise a list of indexes of the position of the sample in the encapsulated stream (‘mdat’ box). For example, as illustrated with reference 520, the parser may determine that the 10th sample of the encoded stream corresponds to the 3rd sample in the encapsulated stream (that is to say to location‘2’ if locations are indexed from 0). Therefore, by using the list of indexes, the parser may reconstruct the encoded stream from the encapsulated stream.
According to other embodiments, the indexes of the list of indexes correspond the location (the order) of the samples of the encapsulated stream in the encoded stream.
In these other embodiments, the sample description follows the order of the encapsulated stream 510 which requires the parser to reorder the samples data before transmission to the decoder. This reordering uses the list of indexes providing the location (order) of the samples in the encoded stream. For parsers to identify which order (encapsulation or encoding) indexes are provided in the list of indexes 515, the packager may include an indication at file level (e.g. in‘ftyp’ box) so that the parsers are able to identify which order (encapsulation or encoding) indexes are provided in the list of indexes 515. The identification may be for example a brand or a compatible brand. Such a brand indicates reordered samples and the actual order used by the packager. For example, a brand‘reor’ (this four character code is just an example, any reserved code not conflicting with other four character codes already in use can be used) indicates that samples are reordered. For example, a brand‘reoT indicates presence of reordering as in 515 (from decoding order to encapsulation order) while, still for example, a brand‘reo2’ indicates presence of reordering from encapsulation order 510 to decoding order 505. This indication that encapsulation has been done with reordering may alternatively be included in the box indicating presence of movie fragments (e.g. the‘mvex’ or‘mehd’ box). For example, it may be an additional field (for example reordering type) in new versions of the‘mehd’: aligned(8) class MovieExtendsHeaderBox extends FullBox('mehd', version, 0) {
if (version==3) {
unsigned int( 64) fragment_duration;
unsigned int (8) reordering_type;
} else if (version==2) {
unsigned int(32) fragment_duration;
unsigned int (8) reordering_type;
} else if (version== 1 ) {
unsigned int( 64) fragment_duration;
} else { // version== 0
unsigned int(32) fragment_duration;
}
}
with a reordering type value set to 0 indicates that there is no reordering. When it is set to 1 , there is reordering with a list of indexes providing mapping from decoding order 505 to encapsulation order 510 (as 515) and when it is set to 2, there is reordering with a list of indexes providing mapping from encapsulation order 510 to decoding order 505. Other values are reserved.
The packager may decide to reorder only some fragments. In other words, an encapsulated media file or segment may contain fragments with reordering and fragments without reordering.
According to embodiments, the list of indexes is stored in a‘trun’ box (e.g. a standard‘trun’ box, a compact‘trun’ box, or a‘trun’ box relying on patterns) when the media file is fragmented.
To extract a sample from the encapsulated stream, the offset of the first byte for this sample in the ‘mdat’ box is computed from the corresponding index (by multiplying the value of the index by the size of a sample when sample_size is constant across all samples of the run). When it is not constant across the samples, the size of the sample is provided in the‘trun’ box through the sample_size field. The sample duration and sample composition time offset may also be provided.
Figure 6a illustrates an example of steps for reordering samples of an encoded stream in an encapsulated stream. These steps may be carried out in a server, for example in a writer or a packager.
As illustrated, a first step is directed tojnitializing the packaging module (step
600). For example, the packaging may be set to a particular reordering configuration by a user or through parameter settings during this step. An item of information provided during this step is whether the encapsulation is to be done within fragments or not. When the encapsulation is to be done within fragments, the fragment duration is obtained. These items of information are useful for the packaging module to prepare the box structure, comprising or not the‘moof and‘traf boxes. When samples are reordered, the packaging module sets the dedicated flags value in the‘trun’ box with the appropriate value, as described here after.
During initialization, the packaging module receives the number of priority levels to consider and may allocate one sample list per level, except for the first level (for which samples will be directly written in the‘mdat’ box). Such an allocation may be made, for example, in memory 312 (in Figure 3) of the server. The packaging module then initializes two indexes denoted index_first and index_second, respectively providing the sample index in the first order and the sample index in the second order. According to the given example, both indexes are initialized to zero. A mapping table (as illustrated in Figure 6b) for reordering the samples is also allocated with its size corresponding to the number of samples expected per fragment (e.g. the sample_count related to the fragment duration) multiplied by one plus the number of levels.
As illustrated in Figure 6b, mapping table 690 contains one table per level (denoted 692, 693, 694) and a mapping list (denoted 691 ) indexed by the sample index in the first order and providing access to the appropriate table per level (reference 692 or 693 or 694), given the sample level (e.g. i-th sample with level / is stored in mapping_table[l][i]). Each cell of a table per level can store the sample description (e.g. reference 695) and the sample data (e.g. reference 696). It is to be noted that to allocate less memory, mapping table 690 may store the sample descriptions in mapping list 691 in addition to the level indication (instead of one of the 695 cell).
Next, the samples are processed one after another. To that end, the packaging module reads the next sample to be processed (step 605), corresponding, at the beginning, to the first sample of the encoded stream.
The priority level of the read sample is then obtained and index index_first is incremented by one (step 610). The corresponding list (692, 693 or 694) in mapping_table[sample_level] is then selected at step 615. The sample description is computed and stored in the mapping table (one of 695) at step 620. For example, the sample description may be, depending on the values of the‘trun’ box’s flags,“tr_flags”, the position of the first byte for the sample, the duration of the sample, the size of the sample, or the value of index index_second. A composition_time_offset may also be provided. The index index_second is initially set to zero. The bytes corresponding to the sample data are also stored in the mapping table (one of 695) at step 625. Description and data are stored at mapping_table[sample_level][first_index]. Then the firstjndex is incremented by one (630).
This process iterates until the end of the fragment is not reached (i.e. the result of test 640 is false) by reading another sample from the encoded stream. When the end of the fragment is reached (i.e. the result of test 640 is true), the packager flushes the data buffered in the mapping table 690 (step 640). This mapping table is used for reordering samples. At this stage, index index_second for the stored samples is not known and still set to zero. The packaging module starts flushing the data buffered during steps 620 and 625 (step 645) from a lower level (high priority or most important samples) to a higher level (low priority or less important samples) as follows: list table 691 is read and only the sample pertaining to the current level are candidate to flush (step 650). One after the other, their sample data (read from one of 696) are written into data part of the media file or segment, i.e.‘mdat’ box, and their corresponding sample description (read from one of 695) is written into the descriptive metadata part (step 655), for example in the‘trun’ box or in any box describing the run of samples.
According to embodiments, the sample description contains, in addition to usual parameters (e.g. sample_duration, sample_size, etc.), the reordering information (or interleaving index) that is set to the current value of index index_second maintained by the packaging module (actually the position of the last written sample in the data part plus one). The interleaving index shall be unique within the‘trun’ box. Each time a sample is flushed (step 655), the index index_second is incremented by one. The packager iterates over the level (test 660).
At the end, the‘trun’ box contains reordering for the samples of the fragment.
When all buffered samples for all levels have been flushed, the packaging module finalizes the encapsulation of the fragment that is stored or transmitted, depending on the application.
Examples of ‘trun’ box are provided after the description of the parsing process (Figure 7).
Figure 7 illustrates an example of steps for reordering samples of an encapsulated stream into an encoded video stream. These steps are carried out in a client, at the reader end, for example in a parser. As illustrated, a first step aims at receiving one encapsulated fragment (step 700). According to embodiments, the parser reads the descriptive metadata part (step 705), for example from the‘moof box and its sub-boxes (e.g.‘traf and‘trun’ boxes). Using this information, the parser determines whether reordering has to be performed or not (step 710). This may be done, for example, by checking the“tr flags”. If no reordering has to be done, the run of samples can be extracted from the‘mdat’ box (step 715), sample after sample, from the first byte offset to the last byte, on a standard basis. The last byte position is computed by cumulated sample sizes from the first sample to the last sample of the fragment as indicated by“sample_count” in the‘trun’ box.
On the contrary, when samples are to be reordered, the sample description is read (step 720) so as to read each sample index in the expected output order (step 725). Accordingly, reordering information is read by reading the sample reordering information inserted in the‘trun’ box (for example in a standard‘trun’ box, in a compact ‘trun’ box, or in a‘trun’ box relying on patterns, as explained below). Indeed, as described by reference to Figure 4 (in particular from reference 410) and according to particular embodiments, the encapsulation step assigns an interleaving index k, in the range [0, trun::sample_count - 1 ], to each sample. The sample sizes read are available through a “sample_size[]” variable, where the ith entry gives the size of the ith sample in decoding order in the trun (first sample stored at / = 0). The trun has its data offset sets on a standard basis (e.g. beginning of the‘moof or the default_data_offset potentially present in the track fragment header box or the data_offset provided in the trun) and the sample_data_offset, relative to this data offset, for the current sample with interleave index k is computed (step 730) as follows:
Figure imgf000027_0001
the second order (encapsulated order) the index of this same sample in the first order (e.g. decoding order). These indexes are computed sample after sample by the parser during steps 725 and 730.
From the so-computed sample data offset, the parser can extract the number of bytes corresponding to sample_size and provide it to the decoder.
As illustrated, the parser iterates over all the samples (step 735) until the end of the fragment. The process continues until there is no more any fragment to process. When one fragment is processed, the extracted video stream can be processed by the video decoder.
Signaling sample reordering using a standard‘trun’ box
In particular embodiments, the reordered fragments are described within the standard‘trun’ box. To indicate reordered samples, the standard‘trun’ box is modified as follows. First a new flag value for the‘trun’ box is defined. For example, the value 0x100000 is reserved with the name SAMPLE INTERLEAVE BIT to indicate that samples data are stored in a different order than the decoding order. It is to be noted that the flag value here are provided as examples, any reserved value and name may be used provided that it does not conflict with other reserved flags values for the‘trun’ box. In addition to the new flags value, the trun contains an additional parameter. This can be handled as a new version of the‘trun’ box, as follows (in bold):
aligned(8) class TrackRunBox extends FullBox('trun', version, tr_flags) {
unsigned int(32)sample_count;
//the following are optional fields
signed int(32) data_offset;
unsigned int(32) first_sample_flags;
//all fields in the following array are optional
// as indicated by bits set in the tr_flags
//in particular the indication for reordering
{
unsigned int( 32) sample_duration;
unsigned int(32)sample_size;
unsigned int(32)sample_flags
if (version == 0)
{ unsigned int(32) sample_composition_time_offset; }
else
{ signed int(32) sample_composition_time_offset; }
//presence depending on tr_flags value
if (version == 2) {
unsigned int(32) sample_interleave_index;
}
}[ sample_count ]
}
where samplejnterleavejndex indicates the order of sample interleaving in the‘trun’ box. A value of 0 indicates that the sample data start at the trun data offset. A value of K>0 indicates that the sample data start at the trun data offset plus the sum of the size of all samples with an interleaving index strictly less than K. There shall not be two samples with the same interleaving index in the same‘trun’ box.
The semantics of the other parameters of the‘trun’ box remains unchanged.
Signaling sample reordering using a compact, or optimized,‘trun’ box.
In particular embodiments, the reordered fragments are described with the compact‘trun’ box. To indicate reordered samples, the compact‘trun’ box is modified as follows (in bold): aligned(8) class CompactTrackRunBox
extends FullBox('ctrn', version, tr_flags) {
//all index fields take value 0, 1,2,3 indicating 0, 1,2,4 bytes
unsigned int(2) duration_size_index;
unsigned int(2) sampie size index;
unsigned int(2) flags_size_index;
unsigned int(2) composition sizejndex;
//sample interleaving takes value 0, 1,2,3 indicating 0, 1,2,4 bytes
//a value of 0 means no interleaving
unsigned int(2)interleave_index_size;
unsigned int(30) sample_count;
//the following are optional fields
if (data_offset_present)
signed int(32) data_offset;
if (first_sample_info_present) {
unsigned int(32) first_sample_size;
unsigned int(32) first_sample_flags;
} // all the following arrays are effectively optional
// as the field sizes can be zero
unsigned int( f(duration_size_index) )
sample_duration[ sample_count ];
unsigned int( f(sample_size_index) )
sample_size[ sample_count - (first_sample_info_present ? 1 : 0) ];
unsigned int(f(flags_size_index))
sample_flags[ sample_count - (first_sample_info_present ? 1 : 0) ]; if (version == 0)
{ unsigned int(f(composition_size_index))
sample_composition_time_offset[ sample_count ]; }
else
{ signed int(f(composition_size_index))
sample_composition_time_offset[ sample_count ]; }
if (interleave_index_size) {
unsigned int(f(interieave_index_size))
sampie_interieave_index [ sample_count];
}
}
where samplejnterleavejndex indicates the order of sample interleaving in the trun. A value of 0 indicates that the sample data start at the trun data offset. A value of K>0 indicates that the sample data start at the trun data offset plus the sum of the size of all samples with an interleaving index strictly less than K. There shall not be two samples with the same interleaving index in the same trun. The semantics of the other parameters of the compact‘trun’ box remains unchanged.
Accordingly, the samplejnterleavejndex provides, for each sample in the run of samples, the index of the position of this sample in the media data part of the file or the segment (e.g. in‘mdat’ box). It is to be noted that for the sake of compaction, the 32 bit sample count can be moved into a 30 bit field. In embodiment variants, the sample count is still encoded on 32 bits and 8 more bits can be allocated to store the interleavejndex_size in the proposed box. In the compact‘trun’ box, when the packager set the value of interleavejndex_size to 0, the parser may interpret it as meaning there is no need to apply a reordering step. When the value set by the packager is different than 0, the parser has to read it from the descriptive metadata and to apply a reordering steps to make sure to produce a compliant bit stream for the video decoder.
In a variant, for compaction of the samplejnterleavejndex, the difference between the sample positions in the first and second orders is provided in the samplejnterleavejndex field, instead of providing the sample offset in the second order. For the sake of illustration, a delta from the decoding order may be used instead of a sample index.
Considering the following original GOP (Group Of Pictures) layout and the associated decoding order
l B g16 B7 27 Be38 Bs39 Bis 1 ,1 0 D, 2,1 1 D, 3,12 D , 3,13 D , 1 ,14 D, 2,15
1i1°"’°' p r25 01 35 12 B 33 23 B D2 34 B4 35 B 39 D1 1 D1 0 D12 D17 t>15
B, ..3,16 D , 3,17 D , , 1 ,18 D , 2,19 D, 3,20 D 3,21 D 2,22 D .3,23 D ,3,24
14 D16 t>21 D19 D18 t>20 t>23 t>22 t>24 the final layout after reordering (wherein, in Bc,s t d, c is the composition order, t is the temporal layer, d is the decoding order, s storage order in the reordered trun) may be:
Figure imgf000031_0001
Accordingly, the difference in decoding order is:
0 0 0 3 6 9 12 -4 -1 -1 2 5 8 10 -9 -9 -7 -5 -5 -3 -3 -1 -1 0 0
This mode may be advantageous for small reordering, i.e. when samples position in the second order (i.e. storage order) are not too far from the first order (i.e. the decoding order). It may not be suitable for samples from the end of the GOP being reordered at the beginning of the run and when samples at the beginning of the GOP being reordered at the end of the run.
In another variant, when the reordering leads to repetitive patterns, the samplejnterleavejndex may be coded as a pattern identifier, as done in compact sample groups.
Considering the following GOP pattern wherein each temporal layer is described in its own track run
TRUN li° P25° B5 1 TRUN B3 2 TRUN B2 3 B4 3 TRUN B9 1 TRUN B7 2 TRUN B6 3 B8 3 TRUN
Figure imgf000031_0002
Bi g2 TRUN
Figure imgf000031_0003
TRUN B23 2 TRUN B223 B243a pattern description may be the following:
pattern 1 : {0} 2 samples
pattern 2: {1 2 3 3} 20 samples
pattern 3: {2 3 3} 3 samples
and the 8 index values.
For a non byte-aligned version, with pattern coded on two bits and samples coded on six bits, the patterns may be encoded with 36 bits (2+6 + 4*2+6 + 3*2+6) and the values may be encoded with 64 bits (8x8), hence 13 bytes total.
For a byte-aligned version, with patterns coded on 8 bits and samples also coded on 8, the patterns may be encoded with 88 bits (8+8 + 4*8+8 + 3*8+8) and the values may be encoded with 64 bits (8x8), hence 19 bytes total.
The longer the GOPs the more efficient is the pattern approach since patterns of temporal levels are expected to repeat during the GOP. This requires the encoding of the pattern list within the reordered track run box (compact or not). Alternatively, the pattern list may be declared in a companion box of the track run box. This allows mutualization of the pattern declarations, across segments of different tracks with reordered fragments or across segments of a single track with reordered fragments.
Still in an alternative embodiment, all the information describing the reordering comes in a new box associated to the trun box.
A variant for the indication of the number of bits in use for the sample_count and the samplejnterleavejndex consists in using a single flag value to indicate whether reordering is active or not. Then, the number of bits in use for the samplejnterleavejndex is the same number of bits than the one in use for the sample_count field. For the latter, considering that optimized or compact trun are dedicated to short fragments, using 16 bits instead of 32 bits is more relevant. For some very short fragments, sample_count could even be encoded on 8 bits. Then, another flag value is dedicated to indicate whether the sample_count in a compact trun box is encoded on 16 or 8 bits.
Signaling sample reordering using a“trun’ box reiving on paterns
Since most encoders provide regular temporal GOP structure, the inventors realized that when the sample description uses ‘trun’ box relying on patterns, the interleaving information could be added at no cost in such‘trun’ box. This is achieved by modifying the proposed TrackRunPatternStruct and adding a new flag to the TrackRunPatternBox. For example, the box containing the declaration of the patterns defines the following flag value:
0x100000 SAMPLE INTERLEAVE BIT, when set, it indicates that samples data are stored in a different order than the decoding order. It is to be noted that the flag value here are provided as examples, any reserved value and name may be used provided that it does not conflict with other reserved flags values for the‘trun’ box.
The definition of a given pattern according to embodiments of the invention is modified as follow to support the reordering of fragments (in bold):
aligned(8) class TrackRunPatternStruct(version, patldx, numSamples, boxFIags,
numBitsSampleCount)
{
for (i= 0; i < numSamples; i++) {
if (boxFIags & SAMPLE DURA TION PRESENT)
unsigned int(numBitsSampleDuration) sample_duration[patldx][i]; if (box Flags & SAMPLE_FLAGS_PRESENT)
unsigned int(32)sample_flags[patldx][i]; if (boxFIags & SAMPLE_CT_OFFSETS_PRESENT) {
if (version == 0)
signed int(numBitsCTOffset)
sample_composition_time_offset[patldx][i];
else
unsigned int(numBitsCTOffset)
sample_composition_time_offset[patldx][i];
}
if (boxFIags & SAMPLE_INTERLEAVE_BIT)
unsigned int(numBitsSampleCount) sampie_interieave_index [patldx][i]; } if (boxFIags & SAMPLE SIZE PRESENT) {
for(i = 0; i < numSamples; i++)
unsigned int(4) num_sample_size_nibbles_minus2[patldx][i];
numBitsSampleSize[patldx][i] =
(num_sample_size_nibbles_minus2[patldx][i] + 2) * 4;
}
if (numSamples % 2)
bit(4) reserved = 0;
}
}
where samplejnterleavejndex indicates the order of sample interleaving in the‘trun’ box. Value of 0 indicates that the sample data start at the‘trun’ data offset. A value of K>0 indicates that the sample data start at the‘trun’ data offset plus the sum of the size of all samples with an interleaving index strictly less than K. There shall not be two samples with the same interleaving index in the same‘trun’ box.
This design has the advantage of not using any bit in the‘trun’ box relying on patterns to indicate the sample interleaving or reordering. For dynamic use cases (GOP structure varying), either a new template could be updated or the compact‘trun’ could be used.
The semantics of the TrackPatternStruct are unchanged. Optimization of the‘trun’ box
The repetition of fragments in movie files and the fact that they are getting shorter (typically half a second of video in CMAF), the description cost of fragments may become significant. Initial proposals to optimize the‘trun’ box are not so exhaustive. The inventors have observed that there are fields that may compressed even more, whatever the type of‘trun’ box (e.g., standard‘trun’ box, compact‘trun’ box or‘trun’ box relying on patterns). According to embodiments, the following fields may be optimized:
- data_offset the number of bits for the description of the data_offset for a run of samples. According to embodiments, it is coded on a number of bits lower than the 32 bits used in the standard‘trun’ box. Indeed, the inventors have observed that both the pattern‘trun’ box and the compact‘trun’ box designs still use 32 bits for the data offset, which is very large in DASH/CMAF cases since the base offset is the start of the‘moof box. In most cases, 16 bits is more than enough. Therefore, according to embodiments, the possibility is given to use a smaller field, signaled through a new‘trun’ flags value. For example, the value 0x100000 may be defined and reserved for a flag value denoted“DATA OFFSET 16” that indicates, when set, that the data offset field is coded on a short number of bits (e.g., 16 bits). This flags value shall not be set if the data_offset_present flags value of the T rackRunBox is not set and the base-data-offset-present flags of the
TrackFragmentHeaderBox is not set;
Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items (e.g., boxes),
wherein a metadata item (e.g., the trun box) comprises a flag (e.g., DATA OFFSET 16) indicating whether a data offset is coded on a predetermined size or not; and,
encapsulating the timed media data and the generated metadata.
According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data.
- sample_counb the number of samples for a run of samples. According to embodiments, it is coded by a smaller number of bits since the fragments and then runs of samples are becoming smaller (less samples) for low latency purpose. This is for example the case of CMAF fragments that can correspond to 0.5 second of video, then“only” 15 samples for a video at 30 Hertz or 30 samples for a video at 60 Hertz. Using 32 bits is not so efficient since in most cases, a lower number of bits such as 8 or 16 bits would be sufficient. According to embodiments, the sample count field is of variable or configurable size. Moreover, the sample count from one fragment to another may remain the same. This may be the case when the GOP structure used by the video encoder is constant along over the time. In such a case, a default number of sample count can be defined and overloaded when necessary. Since this item of information may be useful for the whole file, it is set by the encapsulation or packaging module in a box of the initialization segment. It can be for example inserted in a new version of TrackExtendBox‘trex’ or in any box dedicated to the storage of default values used by the movie fragments:
aligned(8) class TrackExtendsBox extends FullBox('trex', version, 0){
unsigned int(32)track_ID;
unsigned int(32)default_sample_description_index;
unsigned int(32)default_sample_duration;
unsigned int(32)default_sample_size;
unsigned int(32)default_sample_flags;
if (version == 1) {
//complete list of default values
unsigned int (32) default_sample_count;
unsigned int (32) default_data_offset;
unsigned int (32) default_first_sample_size;
unsigned int (32) default_first_sample_flags;
unsigned int (32) default_composition_time_offset;
}
}
where the default parameters have the same semantics as in the‘trun’ box. It is to be noted that when there is no handling of the first sample in the trun, the default_first_sample_size and default_first_sample_flags parameters could be omitted. To be able to distinguish between the two modes: first sample inside the loop on samples or first sample outside the loop, two versions of the new‘trex’ box may be used. In such case version = 1 defines the following list of default values: default_sample_count; default_data_offset; default_composition_time_offset; when version >=1 , the additional parameters to handle first sample are provided: default_first_sample_size; default_first_sample_flags.
This new‘trex’ box may be used with the compact‘trun’ box or the‘trun’ box relying on patterns.
The presence or absence of sample_count information in the description of a run of samples may be indicated by a dedicated flags value of the‘trun’ box.
Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items (e.g., boxes), a metadata item of the structured metadata items (e.g., the trun pattern box) comprising a configurable parameter having a configurable size,
wherein the metadata comprises an indication information (e.g., SAMPLE COUNT PRESENT) indicating whether a sample count field is present or not; and,
encapsulating the timed media data and the generated metadata.
According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data.
Specific embodiments depending on the kind of the‘trun’ box (standard, compact, or relying on patterns) are described below.
The following embodiments can be implemented using different versions of the‘trun’ box or any equivalent box to describe a run of samples.
In a variant, the box structures describing the fragments, especially the run of samples, do not comprise flag value indicating presence or absence of some parameters. Instead, an exhaustive list of default values is defined for any parameter describing the run of samples, in a fragment. It can be done at the file level (e.g. trex_box or equivalent) when applying to the whole file. We call this variant the“exhaustive default values mode”.
In this variant, depending on the number of bits in use for the fields describing runs of samples, the default value may be overloaded for a given fragment or even only for a given sample in the run of samples. Having a variable number of bits from 0 to 32, combined with the exhaustive list of default values, avoids parsers the tests on flag values to determine whether a parameter describing the run of samples is present or not in the‘trun’ box. For example, assuming the parser just reads the sample_du ration value for a given sample, the parser has to determine whether a next parameter is present or not (since their presence is conditioned to the flag value in the‘trun’ and/or in the‘tfhd’ boxes). Before reading the next parameter, the parser has to check presence or absence of the sample_size in the‘trun’ box. This requires checking whether the‘trun’ box has a predetermined value such as 0x000200 (for indication of sample-size-present) of its flags sets. If not, the parser has to further check whether the track fragment header box contains a default value for the sample size. Then, depending on the results of these tests, the parser may have to interpret a parameter in the‘trun‘ as the sample_size parameter for the current sample. This exhaustive default value mode with the variable number of bits between 0 to 32 for each parameter avoids carrying out these tests. By default, when parsing a sample, the default values are set. Then, the parser, informed at the beginning of the trun parsing of the number of bits in use for each parameter, is able to determine how many bits to parse for a given parameter. This makes the description of run of samples simpler to parse and even more efficient. This variant may be used with the compact‘trun’ box or with‘trun’ boxes relying on patterns as described later in this invention.
Optimization of the‘trun’ box using a compact‘trun’ box
In embodiments, the compact‘trun’ is used to describe the runs of samples within media fragments and is further improved by the optimizations discussed above. The structure of the compact‘trun’ box is then modified as follows:
aligned(8) class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) {
//all index fields take value 0, 1,2,3 indicating 0, 1,2,4 bytes
unsigned int(2) duration_size_index;
unsigned int(2) sample_size_index;
unsigned int(2) flags_size_index;
unsigned int(2) composition_size_index;
unsigned int(2) samp!e_count_size_index if (tr_flags & SAMPLE_COUNT_PRESENT) { //this flags is optional since number of bits returned by f(sample_count_size_index) may be zero
unsigned int( f(sample_count_size_index)) sample_count;
}
// the following are optional fields
if (data_offset_present)
If (tr_f lags & DATA_0FFSET_16){
signed int(16) data_offset; } else {
signed int(32) data_offset;
}
if (first_sample_info_present) {
unsigned int(32) first_sample_size;
unsigned int(32) first_sample_flags;
}
//all the following arrays are effectively optional
// as the field sizes can be zero
unsigned int(f(duration_size_index))
sample_duration[ sample_count ];
unsigned int( f(sample_size_index) )
sample_size[ sample_count - (first_sample_info_present ? 1 : 0) ]; unsigned int( f( flags_size_ index) )
sample_flags[ sample_count - (first_sample_info_present ? 1 : 0) ]; if (version == 0)
{ unsigned int(f(composition_size_index))
sample_composition_time_offset[ sample_count ]; }
else
{ signed int(f(composition_size_index))
sample_composition_time_offset[ sample_count ]; }
}
The data_offset may also be provided with a variable number of bits as an alternative to the flags value DATA OFFSET 16.
In embodiments where the compact ‘trun’ box is used to describe the fragments, the description of the samples within the fragments, i.e. the runs of samples, is further optimized as described below. The description of the first sample is using 32 bits to encode the size and the sample_flags values for the first sample in the run of samples. This could be further optimized by using a variable number of bits for these items of information. The packaging module determines the required number of bits and sets the actual value in use to describe the first sample inside the compact‘trun’ box. The compact‘trun’ box is then modified as follows (with the semantics unchanged):
aligned(8) class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) {
//all index fields take value 0, 1,2,3 indicating 0, 1,2,4 bytes
unsigned int(2) duration_size_index;
unsigned int(2) sample_size_index;
unsigned int(2) flags_size_index;
unsigned int(2) composition sizejndex;
unsigned int(2) sample_count_size_index;
unsigned int(2) first_sample_size_index; //0 if ! first-sample-info-present
unsigned int(2) first_flags_size_index; //0 if ! first-sample-info-present
unsigned int(2) reserved=0;
unsigned int(f(sample_count_size_index)) sample_count;
// the following are optional fields
if (data_offset_present)
if (tr_flags & DA TA_OFFSET_ 16) signed int( 16) datajoffset;
else signed int(32) data_offset;
if (first_sample_info_present) {
unsigned int(f(first_sample_size_index) ) first_sample_size;
unsigned int(f(first_flags_size_index) ) first_sample_flags;
}
}
In a variant, rather than considering that the compact‘trun’ box relies on flag values for presence or absence indication of some parameters, the compact‘trun’ box uses the“exhaustive default values” mode. The‘trex’ box exhaustively defines default values for each field or parameter describing a run of samples. The 2-bit fields for size ndex are exhaustive, i.e. all the fields in the compact‘trun’ box come with an indication of the number of bits used to encode them. This makes parsing simpler by avoiding some tests carried out on a sample basis. The CompactTrackRunBox is modified as follows:
aligned(8) class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) {
//all index fields take value 0, 1,2,3 indicating 0, 1,2,4 bytes
unsigned int(2) duration_size_index; unsigned int(2) sample_size_index;
unsigned int(2) flags_size_index;
unsigned int(2) composition sizejndex;
unsigned int(2) sample_count_size_index;
unsigned int(2) first_sample_size_index;
unsigned int(2) first_flags_size_index;
unsigned int(2) data_offset_size_index;
unsigned int(f(sample_count_size_index)) sample_count;
signed int (f(data_offset_size_index) ) data_offset;
unsigned int(f(first_sample_size_index) ) first_sample_size;
unsigned int(f(first_flags_size_index) ) first_sample_flags;
// all the following arrays are effectively optional
// as the field sizes can be zero
unsigned int(f(duration_size_index)) sample_duration[ sample_count ];
unsigned int(f(sample_size_index)) sample_size[ sample_count -
(first_sample_info_present ? 1 : 0) ];
unsigned int(f(flags_size_index)) sample_flags[ sample_count -
(first_sample_info_present ? 1 : 0) ];
if (version == 0) {
unsigned int(f(composition_size_index))
sample_composition_time_offset[ sample_count ];
} else {
signed int(f(composition_size_index) )
sample_composition_time_offset[ sample_count ];
}
}
Each optimization may be applied independently from the others but the maximum compression can generally be obtained by combining these all together. Advantageously, for optimized versions of the track run box providing a sample_count_size_index, the interleave_index_size parameter becomes optional because it can be deduced as equal to the sample_count_size_index. This saves some flag values and avoids double declaration of a same value. This is because the signaling of the new sample positions after reordering, when coded as sample offsets, do not require more values than the number of samples (sample_count) declared in the track run.
In alternative embodiments, the compact‘trun’ box uses the “exhaustive default values” mode and the sample description is provided in a loop on the samples (and not as a list of arrays as in the above): aligned (8) class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) {
//all index fields take value 0, 1,2,3 indicating 0, 1,2,4 bytes
unsigned int(2) duration_size_index;
unsigned int(2) sample_size_index;
unsigned int(2) flags_size_index;
unsigned int(2) composition sizejndex;
unsigned int(2) sample_count_size_index;
unsigned int(2) first_sample_size_index;
unsigned int(2) first_flags_size_index;
unsigned int(2) data_offset_size_index;
unsigned int(f(sample_count_size_index)) sample_count;
signed int (f(data_offset_size_index) ) data_offset;
unsigned int(f(first_sample_size_index) ) first_sample_size;
unsigned int(f(first_flags_size_index) ) first_sample_flags;
for (sample=0; sample<sample_count; sample++) {
//all the following parameters are effectively optional
// as the field sizes can be zero
unsigned int(f(duration_size_index)) sample_duration;
unsigned int( f(sample_size_index) ) sample_size;
unsigned int( f( flags_size_ index) ) sample_flags;
if (version == 0) {
unsigned int(f(composition_size_index) ) sample_composition_time_offset; } else {
signed int(f(composition_size_index) ) sample_composition_time_offset;
}
} with unchanged semantics.
In particular embodiments, the compact‘trun’ box uses the“exhaustive default values” mode with a variable number of bits to encode the parameters, but these number of bits, instead of being defined in dedicated 2 bits codes are specified through the flags value of the compact‘trun’ box. This is possible because with the“exhaustive default values” mode, we don’t need flags anymore to indicate presence or absence of the parameters. This saves 16 bits per trun box. The compact‘trun’ box then rewrites (with unchanged semantics, only bit length computation) as follows:
aligned(8) class CompactTrackRunBox extends FullBox('ctrn', version, tr_flags) {
unsigned int( nb_bits (tr_flags)) sample_count;
signed int ( nb_bits (tr_flags » 2)) data_offset; //» the binary shift operator //information on first sample:
unsigned int( nb_bits (tr_flags » 4)) sample_duration;
unsigned int( nb_bits (tr_flags » 6)) first_sample_size;
unsigned int( nb_bits (tr_flags » 8) ) first_sample_flags;
if (version == 0) {
unsigned int( nb_bits (tr_flags » 10)) sample_composition_time_offset;
} else {
signed int(nb_bits (tr_flags » 10)) sample_composition_time_offset;
}
// remaining samples:
for (sample= 1; sample<sample_count; sample++) {
//all the following parameters are effectively optional
// as the field sizes can be zero
unsigned int(nb_bits (tr_flags » 4)) sample_duration;
unsigned int(nb_bits (tr_flags » 6)) sample_size;
unsigned int(nb_bits (tr_flags » 8)) sample_flags;
if (version == 0) {
unsigned int(nb_bits (tr_flags » 10)) sample_composition_time_offset;
} else {
signed int(nb_bits (tr_flags » 10)) sample_composition_time_offset;
}
}
}
Where the following tr_flags values are defined for the compact‘trun’ box:
For a given parameter, a 2bits_flags_value is defined. The following function returns the actual number of bits in use for a given 2bits_flags_value.
unsigned int(8) nb_bits(2bits_flags_value) {
switch(2bits_flags_value / 00000011) {// / is the binaty OR operator
case 0: return 0;
case 1: return 8;
case 2: return 16;
case 3: return 32;
}
The 2bits_flags_value may be defined:
sample_count_2bit_flags_value, is one value in [OxOO, 0x01, 0x02, 0x03]
data_offset_2bit_flags_value is one value in [OxOO, 0x04, 0x08, OxC]
first_sample_size_2bit_flags_value is one value in [0x00, 0x10, 0x20, 0x30]
first_sample_flags_2bit_flags_value is one value in in [0x00, 0x40, 0x80, OxCO]
sample_duration_2bit_flags_value is one value in [0x00, 0x100, 0x200, 0x300] sample_size_2bit_flags_value is one value in [0x00, 0x400, 0x800, OxCOO] sample_flags_2bit_flags_value value is one value in [0x00, 0x1000, 0x2000, 0x3000] composition_time_2bit_flags_value value is one value in the reserved value range [0x00, 0x4000, 0x8000, OxCOOO], From the above list, the 16-bit word formed by the tr flags value provide the number of bits used to represent each parameter of the compact‘trun’ box. The order of the declaration and the values above may be changed but the corresponding computation in the compact‘trun’ shall be updated accordingly, typically the bit shifting operation in the call to the nbjoits function. This mechanism is extensible and may be used to add the reordering or interleaving parameter in the same way.
It is to be noted that when an optimized or compact trun box is used, the ssix box potentially indexing the samples described by this optimized or compact trun box may also benefit from the following optimizations (for example using a new version of the ssix box or some flags value):
First, since the subsegment_count field (currently 32 bits in version 0 of ssix) shall be equal to the reference_count in the immediately preceding SegmentlndexBox” (coded on 16 bits), it may be more consistent to describe the subsegment_count on 16 bits in a new version 1 of the‘ssix’. Then, the optimized trun box may use variable size to encode the sample count. There cannot be more ranges declared in the ssix box (range_count field) than the number of samples declared in the track run (sample_count field). The version 1 of the ssix also contains flags indicating the number of bits to use for the range_count. This number of bits may be set equal to the number of bits used to encode the sample_count in the optimized trun box (for example the sample_count_size_index value).
To that end, the ssix box may be modified as follows:
aligned(8) class SubsegmentlndexBox extends FullBox('ssix', version, flags) {
if (version == 0) unsigned int(32) subsegment_count;
else unsigned int( 16) subsegment_count;
for( i= 1; i <= subsegment_count; i++) {
if (version == 0) unsigned int(32) range_count;
else unsigned int (flags & sample_count16_bits) range_count; for ( j= 1; j <= range_count; j ++) {
unsigned int(8) level;
unsigned int(24) range_size;
} }
}
where sample_count16_bits is a reserved value for the flags of a compact trun box (a one bit value as a variant to other variants called“sample_count_size_index” on a two bits value) and optionally of the new ssix box. This reserved value indicates that, when set, sample_count field in the compact trun that is indexed by this new ssix box is coded on 16 bits. When not set, this sample_count field is encoded on 8 bits. When set, it can also be directly interpreted as the range_count of the ssix is described on 16 bits and when not set, it is encoded on 8 bits. When the compact trun has a one bit value for the flags reserved for the sample_count16_bits (instead of two bits value like sample_count_size_index), the one bit saved in the flags field of the compact trun can then be used to indicate the presence or absence of the sample ordering information (for example the samplejnterleavejoit) and the number of bits to use for samplejnterleavejndex is the same as the number of bits inferred from the sample_count16_bits flags. The value is deduced either from a simple test on the value of the flags sample_count16_bits (set or not set) of the compact trun or directly from the sample_count value as follows contained in the compact trun that is indexed by this six box: if (sample_count<256) return 8 else return 16. Optimization of the‘trun’ box using a‘trun’ box reiving on paterns
The Sample_flags information may be compacted in the‘trun’ box relying on patterns. This is useful as it enables storing sample_flags on 16 bits, getting rid of the sample_degradation_priorityt\e\d that is not used by most (if not all) sequences.
According to embodiments, a new flag is introduced in the TrackRun Pattern Box to adapt the number of bits to represent the sample flags information:
aligned(8) class TrackRunPatternBox
extends FullBox('trup', version, flags) {
// length of subsequent syntax elements
unsigned int(2) nbm1_sample_count;
unsigned int(2) nbm1_sample_duration;
unsigned int(2) nbm1_pattern_index;
unsigned int(2) nbm1_ct_offset;
unsigned int(2) nbm1_sample_flags;
unsigned int(6) reserved;
numBitsSampleCount = (nbm1_sampie_count + 1) * 8; numBitsSampleDuration = (nbm1_sample_duration + 1) * 8;
numBitsPatternldx = (nbm1_pattern_index + 1) * 8;
numBitsCTOffset = (nbm1_ct_offset + 1) * 8;
numBitsSampleFlags= (nbm1_sample_flags + 1) * 8;
...
}
The track run pattern struct is adapted accordingly:
aligned(8) class TrackRunPatternStruct(version, patldx, numSamples, boxFIags,
numBitsSampleFlags)
{
for (i = 0; i < numSamples; i++) {
if (boxFIags & SAMPLE DURATION PRESENT)
unsigned int(numBitsSampleDuration) sample_duration[patldx][i]; if (boxFIags & SAMPLE_FLAGS_PRESENT)
unsigned int(numBitsSampleFlags) sample_flags[patldx][i];
}
Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items (e.g., boxes), a metadata item of the structured metadata items (e.g., the‘trun’ pattern box) describing samples using patterns and comprising a configurable parameter (e.g., SAMPLE FLAG) having a configurable size, wherein the configurable parameter provides characteristics (or properties) associated to a sample of the set of continuous samples; and,
encapsulating the timed media data and the generated metadata.
According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data.
Concerning the flags of the‘trun’ box relying on patterns, it seems that some of the flags values introduced for the track run box relying on patterns rather apply to the TrackRun Pattern Box than to the‘trun’ box relying on patterns itself. Accordingly, one may consider either splitting the set of flags or describing the flags allowed in these two boxes. According to a preferred embodiment, the set of flag values is split and new flags are defined for the TrackRun Pattern Box. It is to be noted that the values and names are just examples: any reserved values and names not conflicting with other flag values can be used.
Figure imgf000046_0001
When the fragment description is based on a‘trun’ box relying on patterns, the sample count is always present. In the case where a‘trun’ box describes a single GOP, the sample count may be the same as the sample count value of the pattern. Therefore, according to particular embodiments, a new flag value is defined for the‘trun’ box in pattern mode:
Figure imgf000046_0002
The version of the‘trun’ box relying on patterns is then updated accordingly:
aligned(8) class TrackRunBox extends FullBox('trun', version, tr_flags) {
if (version == 0 // version == 1) {
//syntax of trun is unchanged else if (version >= 2) {
if (numPatterns > 1) { unsigned int(numBitsPatternldx) patjdx;
patldx = patjdx;
}
else patldx = 0;
if (trjlags & SAMPLE COUNT PRESENT)
unsigned int(numBitsSampleCount) sample_count_minus1 ;
if (trjlags & DA TA OFFSET PRESENT)
signed int(32) data_offset;
InitSampleFlag = ((trjlags & FIRST SAMPLE PRESENT) > 0);
if (initSampleFlag == 1) {
unsigned int(numBitsFirstSampleFlags) sample Jlags[0];
unsigned int(numBitsFirstSampleDuration) sampie_duration[0];
unsigned int(numBitsSampleSize) sampie_size[0];
if (version == 2)
unsigned int(numBitsFirstSampleCtOffset)
sampie_compositionJime_offset[0];
else // version 3
signed int(numBitsFirstSampleCtOffset)
sampie_compositionJime_offset[0];
}
if (numBitsSampleSize > 0) {
for (i = initSampleFlag, inPatternldx = 0, totalBits = 0;
i <= sample_count_minus1 ; i++) {
unsigned int(numBitsSampleSize[patldx][inPatternldx]) )
sample_size[i];
totalBits += numBitsSampleSize[patldx][inPatternldx];
refldx[i] = inPatternldx;
inPatternldx = ((inPatternldx + 1) %
(patternjen_minus 1 [patldx] + 1));
}
//byte alignment
numBitslnLastByte = totalBits % 8;
if (numBitslnLastByte)
bit(8-numBitslnLastByte) reserved = 0;
}
}
Another consideration regarding the GOP pattern is that when a video sequence uses a fixed GOP pattern, it is common that the first sample (typically an IDR frame) of the GOP usually has a much larger frame size (than other Predicted or Bi- directional frames). In the meantime, the other properties (sample flags, CT offset, duration) are usually always the same from one sample to another. The current design of the‘trun’ box relying on pattern makes provision for specific handling of the first sample in the‘trun’. However, the pattern structure enables a per-sample number of bits to encode the size, which can be used to handle the first sample of the‘trun’ or the GOP if there are multiple GOPs in the trun (i.e. the pattern is repeated).
According to embodiments, the pattern‘trun’ is simplified by removing all the first sample items of information (and related flags). This is simply done by looping on all the samples in the run instead of starting on the second one (i.e. instead of having specific signaling for the first sample) :
aligned(8) class TrackRunBox extends FullBox('trun', version, tr_flags) {
if (version == 0 // version == 1) {
//syntax unchanged
}
else if (version >= 2) {
if (numPatterns > 1) {
unsigned int(numBitsPatternldx) patjdx;
patldx = patjdx;
}
else patldx = 0;
unsigned int(numBitsSampleCount) sample_count_minus1 ;
if (trjlags & DA TA OFFSET PRESENT)
signed int(32) data_offset;
if (numBitsSampleSize > 0) {
for (i = 0, inPatternldx = 0, totalBits = 0;
i <= sample_count_minus1 ; i++) {
unsigned int(numBitsSampleSize[patldx][inPatternldx]) )
sample_size[i];
totalBits += numBitsSampleSize[patldx][inPatternldx];
refldx[i] = inPatternldx;
inPatternldx = ((inPatternldx + 1) %
(patternjen_minus 1 [patldx] + 1));
}
//byte alignment
numBitslnLastByte = totalBits % 8;
if (numBitslnLastByte)
bit(8-numBitslnLastByte) reserved = 0;
} }
}
According to embodiments, the‘torn’ box relying on patterns is optimized by using a variable bit length for coding the data offset, for example indicated by a specific flag value in the ‘trun’ box. For example, the value 0x100000 and the name “DATA_OFFSET_16” are reserved to indicate that when it is set, this value indicates that the data offset is coded on 16 bit. This flag value shall not be set if the data_offset_present flags value of the TrackRunBox is not set and the base-data-offset- present flags of the TrackFragmentHeaderBox is not set. The‘trun’ box comprising such an optimization then rewrites:
aligned(8) class TrackRunBox extends FullBox('trun', version, tr_flags) {
if (version == 0 // version == 1) {
//syntax unchanged
}
else if (version >= 2) {
if (numPatterns > 1) {
unsigned int(numBitsPatternldx) patjdx;
patldx = patjdx;
}
else patldx = 0; unsigned int(numBitsSampleCount) sample_count_minus1 ;
if (trjlags & DA TA OFFSET PRESENT) {
if (tr_f lags & DA TA_OFFSET_ 16)
signed int(16) data_offset;
else
signed int(32) data_offset;
} initSampleFlag = ((trjlags & FIRST SAMPLE PRESENT) > 0);
if (initSampleFlag == 1) {
unsigned int(numBitsFirstSampleFlags) sample lags[0];
unsigned int(numBitsFirstSampleDuration) sampie_duration[0];
unsigned int(numBitsSampleSize) sampie_size[0];
if (version == 2)
unsigned int(numBitsFirstSampleCtOffset)
sampie_composition ime_offset[0];
else // version 3 signed int(numBitsFirstSampleCtOffset)
sample_composition_time_offset[0];
} if (numBitsSampleSize > 0) {
for (i = initSampleFlag, inPatternldx = 0, totalBits = 0;
i <= sample_count_minus1 ; i++) {
unsigned int(numBitsSampleSize[patldx][inPatternldx]) )
sample_size[i];
totalBits += numBitsSampleSize[patldx][inPatternldx];
refldx[i] = inPatternldx;
inPatternldx = ((inPatternldx + 1) %
(pattern_len_minus 1[patldx] + 1));
}
// byte alignment
numBitslnLastByte = totalBits % 8;
if (numBitslnLastByte)
bit(8-numBitslnLastByte) reserved = 0;
}
}
}
In other embodiments handling the first sample of a run of sample in the‘trun’ box relying on patterns, the pattern currently using samplejlags for all samples (first and others) is modified by using a FIRST SAMPLE FLAGS in the pattern definition, to use a full 32 bit for the first sample of the pattern:
aligned(8) class TrackRunPatternStruct(version, patldx, numSamples, boxFIags) {
for (i = 0; i < numSamples; i++) {
if (boxFIags & SAMPLE_DURATION_PRESENT)
unsigned int(numBitsSampleDuration) sample_duration[patldx][i]; if (i==0) {
if (boxFIags & FIRST_SAMPLE_FLA GS_PRESENT)
unsigned int(32) sample_flags[patldx][i];
} else {
if (boxFIags & SAMPLE_FLAGS_PRESENT)
unsigned int(32) sample_flags[patldx][i];
}
if (boxFIags & SAMPLE_CT_OFFSETS_PRESENT) {
if (version == 0) signed int(numBitsCTOffset) sample_composition_time_offset[patldx][i]; else
unsigned int(numBitsCTOffset)
sample_composition_time_offset[patldx][i];
}
} if (boxFIags & SAMPLE SIZE PRESENT) {
for(i = 0; i < numSamples; i++) {
unsigned int(4) num_sample_size_nibbles_minus2[patldx][i];
numBitsSampleSize[patldx][i] =
(num_sample_size_nibbles_minus2[patldx][i] + 2) * 4;
}
if (numSamples % 2) bit(4) reserved = 0;
}
}
It is to be noted that these optimization on data_offset, first_sample handling, flag values for the sample count presence, the specific flag value for the track run pattern box and the variable bit length for sample_flags, may be combined to further improve the efficiency of the‘trun’ box relying on patterns.
The“exhaustive default values mode” variant can be used for‘trun’ boxes relying on patterns with the exhaustive list of default values, it can be defined in the pattern description. The pattern itself may use some of these default values and the TrackRun Pattern Box is also modified to allow a null number of bits to support absence of one parameter without checking any flags value:
aligned(8) class TrackRunPatternBox
extends FullBox('trup', version, flags) {
//length of subsequent syntax elements (exhaustive list)
unsigned int(2) nbm1_sample_count;
unsigned int(2) nbm1_sample_duration;
unsigned int(2) nbm1_pattern_index;
unsigned int(2) nbm1_ct_offset;
unsigned int(2) nbm1_sample_size;
unsigned int(2) nbm1_sample_flags;
unsigned int(2) nbm1_data_offset;
// These two ones may be omitted if the trun relying on patterns does not include specific processing of the first sample in the run unsigned int(2) nbm1_first_sample_size;
unsigned int(2) nbm1_first_sample_flags;
// 0, 8, 16 or 32 bits are used:
numBitsSampleCount = (nbm1_sample_count & 2) * 16 + (nbm1_sample_count & 1) * 8;
numBitsSampleDuration = (nbm1_sample_duration&2) * 16 + ( (nbm1_sample_duration& 1 ) * 8; numBitsPatternldx = (nbm 1_pattern_index + 1) * 8; // from 0 to 24 bits
numBitsCTOffset = (nbm1_ct_offset&2) * 16 + (nbm1_ct_offset&1) * 8;
numBitsSampleSize= (nbm1_sampie_size&2) * 16 + (nbm1_ sample_size&1) * 8
numBitsSampleFlags = (nbm 1_sample_flags&2) * 16 + (nbm 1_ sample_flags& 1) * 8 numBitsDataOffset = (nbm1_data_offset&2) * 16 + (nbm1_ data_offset& 1) * 8;
// These two ones may be omitted if the trun relying on patterns does not include specific processing of the first sample in the run
numBitsFirstSampleSize = (nbm1_ first_sampie_size&2) * 16 + (nbm1_ first_sample_size &1) * 8;
numBitsSampleFirstSampleFlags= (nbm1_ first_sample_flags&2) * 16 + (nbm1_
first_sample_flags & 1) * 8; numPatterns = 0;
for (i = 0;;i++ ) { // until the end of the box
unsigned int(8) pattern_len_minus1 [i];
TrackRunPatternStruct(version, i, pattern_len_minus1 [i]+ 1, flags) //flags may be no more needed
trackRunPattern[i];
numPatterns++;
}
The different variables in the pattern definition above provide the actual number of bits to describe and to parse samples in a run of samples.
With the above TrackRunPatternBox, the TrackRunPatternStruct can be modified as follows, allowing a parser to avoid tests on presence or absence of parameters in the sample description:
aligned(8) class TrackRunPatternStructfversion, patldx, numSamples, boxFIags) for (i = 0; i < numSamples; i++) {
unsigned int(numBitsSampleDuration) sample_duration[patldx][i];
unsigned int(numBitsSampleFlags) sample_flags[patldx][i];
if (version == 0) signed int(numBitsCTOffset)
sample_composition_time_offset[patldx][i];
else
unsigned int(numBitsCTOffset)
sample_composition_time_offset[patldx][i];
} if (numBitsSampleSize) {
for(i = 0; i < numSamples; i++) {
unsigned int(4) num_sample_size_nibbles_minus2[patldx][i];
numBitsSampleSize[patldx][i] =
(num_sample_size_nibbles_minus2[patldx][i] + 2) * 4;
}
if (numSamples % 2)
bit(4) reserved = 0;
}
}
and the trun box relying on these pattern definitions rewrites as follows:
aligned(8) class TrackRunBox extends FullBox('trun', version, tr_flags) {
if (version == 0 // version == 1) {
//syntax unchanged
}
else if (version >= 2) {
if (numPatterns > 1) {
unsigned int(numBitsPatternldx) patjdx;
patldx = patjdx;
}
else patldx = 0; unsigned int(numBitsSampleCount) sample_count_minus1 ;
signed int(numBitsDataOffset) data_offset;
//No more test on presence of first sample unsigned int(numBitsFirstSampleFlags) sampie_flags[0];
unsigned int(numBitsFirstSampleDuration) sampie_duration[0]; unsigned int(numBitsSampleSize) sampie_size[0];
if (version == 2)
unsigned int(numBitsFirstSampleCtOffset)
sampie_composition_time_offset[0]; else // version 3
signed int(numBitsFirstSampleCtOffset)
sampie_composition_time_offset[0];
} if (numBitsSampleSize > 0) {
for (i = initSampleFlag, inPatternldx = 0, totalBits = 0;
i <= sample_count_minus1 ; i++) {
unsigned int(numBitsSampleSize[patldx][inPatternldx]) )
sample_size[i];
totalBits += numBitsSampleSize[patldx][inPatternldx];
refldx[i] = inPatternldx;
inPatternldx = ((inPatternldx + 1) %
(pattern_len_minus 1[patldx] + 1));
}
//byte alignment
numBitslnLastByte = totalBits % 8;
if (numBitslnLastByte)
bit(8-numBitslnLastByte) reserved = 0;
}
}
}
As for compact‘trun’ box , when the flags of the TrackPatternBox are not used to control the presence or absence of some parameters, the number of bits in use to encode the parameters may be provided as a list of 2bits_flags_value.
For most video formats, the file format may carry within the metadata for sample description the composition time offsets (in ‘ctts’ box) to indicate a sample presentation time. The sample presentation time may correspond to the composition time or may correspond to the composition time adjusted by one or more edit lists (described in‘elst’ box). The composition time offset for a given sample is coded as the difference between the sample presentation time of this sample and the current sample delta (sum of the durations of the previous samples). This offset is usually coded on 32 bits (for example in the standard‘trun’ or in‘ctts’ boxes), or on a smaller number of bits (8 to 32 bits) in the compact trun box. The sample composition time (CT) offset is expressed in media timescale, which for video usually is a multiple of the framerate or a large number (e.g. 90k or 1 M), resulting in large composition offsets, which can be quite heavy in terms of signalling. For some simple framerate (integer number), this is not an issue as a small timescale can be picked, but this does not apply to non-integer framerates or to some distribution systems enforcing a very high timescale. In a typical GOP (Group Of Pictures in a video stream), some frames have a CT offset different than 0, some have a CT of 0 (for which samples this applies depends on the GOP structure and the ctts’ version, i.e. positive or negative offsets).
For example an IBBP pattern repeated in a GOP may have the following decoding and composition times and offsets:
(decoding order) 11 P4 B2 B3 P7 B5 B6...
decoding time 0 10 20 30 40 50 60 (DT)
composition time 10 40 20 30 70 50 60 (CT)
decode delta 10 10 10 10 10 10 10 (DT(n+1 ) - DT(n))
CT offset 10 30 0 0 30 0 0
CT/Decode delta 1 3 0 0 3 0 0
Storing the CT offset per sample (instead of run-length encoding) would allow gaining some space, but would require a different signaling (typically through sample flags) per sample, which in turn is not very efficient. In most cases however, the CTS offset is not just any number, it is a delay in number of frames, and can be expressed as N * sample_duration. Since the sample duration is known in video with constant frame rate, we can see that storing the number of frames instead of the actual offset will achieve higher compactness. For example, for a 30 fps video with a timescale of 30000 in a one second GOP (e.g. sample duration=1000), the CT offset of the first P following the IDR can go up to 29 frames. Hence 29*1000 = 29000, requiring 2 bytes to store the CTS offset but only one byte with our approach (overall gain for the GOP is 30 times 1 byte). For a 3 second GOP (90 frames), the offset could reach 89*1000 = 89000, requiring 3 bytes to store the CT offset, but still only one byte with our approach (overall gain for the GOP = 90 times 2 bytes). In some corner cases, the CT offset might need to be expressed as a multiple of the sample duration (e.g. 29.97 fps at drop boundary). In order to keep the possibility to use both signaling (timescale difference or frame difference); we then propose to define within metadata describing the samples an indication about how the CT field should be interpreted.
Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the timed media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items (e.g., boxes),
wherein a metadata item of the structured metadata items comprises an indication information (e.g., SAMPLE CT FRAME) indicating whether a composition time offset parameter is coded as a function of a sample duration or not; and,
encapsulating the timed media data and the generated metadata..
According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data.
Specific embodiments depending on the kind type of the‘trun’ box (e.g., standard, compact, or relying on patterns) are described below.
In an embodiment where the trun box used to encapsulate the media fragments is a standard‘trun’ box, the indication of the composition_time_offset could be present in the sample description, for example as a specific flags value in the CompositionOffsetBox (‘ctts’):
0x000001 sample-composition-time-offsets-frames : when set, this indicates that the composition offset is coded as a multiple of sample duration, and shall be recomputed by multiplying the coded value by the sample duration. If not set, the composition offset is coded in timescale units. When the flags is set, the fields of the box uses half the bits than when this flags value is not set (16 bits instead of 32) to take benefit of shorter code for the sample_offset.
The‘ctts’ box would then be modified as follows (in bold):
aligned(8) class CompositionOffsetBox extends FullBox('ctts', version, flags) {
unsigned int(32) entry_count;
if (version==0) {
if (flags & 0x000001) {
for (int i=0; i < entry_count; i++) {
unsigned int(16) sample_count;
unsigned int(16) sample_offset;
}
} else {
for (int i=0; i < entryjcount; i++) {
unsigned int(32) sample_count; unsigned int(32) sampiejoffset;
}
}
}
else if (version == 1) {
if (flags & 0x000001) {
for (int i=0; i < entry_count; i++) {
unsigned int(16) sample_count;
signed int(16) sample_offset;
}
} else {
for (int i=0; i < entry_count; i++) {
unsigned int(32) sample_count;
signed int(32) sample_offset;
}
}
}
In an embodiment where the trun box used to encapsulate the media fragments is a compact‘trun’ box, the following flags value is defined (in addition to existing ones) for the compact track run box. It is to be noted that this value, respectively the name, is just an example, any reserved or dedicated value, resp. name, not conflicting with existing flags value, resp. name can be used:
0x001000 sample-composition-time-offsets-frames·, when set, this indicates that the composition offset is coded as a multiple of sample duration (whatever the number of bits used), and shall be recomputed by multiplying the coded value by the sample duration. If not set, the composition offset is coded in timescale units. For example the packaging module 313 on figure 3 can be informed that the encoding is done with constant frame rate. In such case, it sets the flags value and provides the composition time offsets as a multiple of a sample duration, thus reducing the necessary number of bits.
According to embodiments where the‘trun’ box used to encapsulate the media fragments is relying on patterns, an additional flags’ value is defined for the track run pattern box as follows, in addition to the other existing flags values: 0x100000 SAMPLE CT FRAME : when this bit is set, it indicates that the composition offset is coded as a multiple of sample duration, and shall be recomputed by multiplying the coded value by the sample duration. If not set, the composition offset is coded in timescale units.
Again, it is to be noted that this flags value, respectively the name, is just an example, any reserved or dedicated value, respectively name, not conflicting with existing flag value, respectively name can be used.
As an alternative to the flag value indicating that the composition time offset is coded as a multiple of sample duration, this can be inferred in specific cases where the flags in the box hierarchy describing the fragments (e.g.‘moof or‘traf’) indicate a default_duration or that the sample_duration is not present in the‘trun’ box.
According to other embodiments, the media data to encapsulate come with additional or auxiliary data. For example, it is a depth item of information accompanying a video stream. In another example, it is auxiliary data describing encryption parameter per sample as used by MPEG Common Encryption (CENC). Fragmenting the media data and their auxiliary items of information may use the‘trun’ plus the‘saiz’ boxes to encapsulate these data in the same media file or media segment (as mandated for example in ISOBMFF). The current syntax for‘saiz’ box is as follows:
aligned(8) class SampleAuxiliarylnformationSizesBox extends FullBox('saiz', version = 0, flags)
{
if (flags & 1) {
unsigned int(32) aux_info_type;
unsigned int(32) aux_info_type_parameter;
}
unsigned int(8) default_sample_info_size;
unsigned int(32) sample_count;
if (default_sample_info_size == 0) {
unsigned int(8) sample_info_size[ sample_count ];
}
}
In the example use case of media data encryption, the MPEG Common Encryption scheme uses auxiliary data describing encryption parameter per sample. This information typically includes the Initialization Vectors (IV) for the whole sample, or the IV and a list of clear and encrypted byte ranges in the sample (subsample encryption). In some configuration such as cbcs with constant IV, this information is empty and consequently omitted. In other configurations, this information shall be signaled through the sample auxiliary mechanism, using‘saiz’ and‘saio’ boxes (in main movie or in movie fragments). For subsample encryption, the size of auxiliary data can change in the following cases:
- different number of slices in each frame, leading to different number of subsamples for configuration where the slice header shall be unencrypted. This might be useful to let slice header unencrypted when slice header rewriting is needed: for example when mixing tiles. Another case where it is useful is when the application needs to identify which part is encrypted by inspected the slice header, for example in selective encryption use cases where only a spatial part like a slice or a tile is encrypted;
- injection at specific frames of large Supplemental Enhancement Information (SEI) data (for example more than 65k bytes), forcing to create a new subsample entry with no encrypted bytes, but this is not so common;
- mixing both protected and non-protected samples: the protected samples will have an associated‘saiz’ entry different from 0, while the unprotected samples will have an associated‘saiz’ entry equal to 0. This may correspond to an area encrypted for privacy reason or to an area where you have to pay to see the content of a particular area of interest;
- mixing different configuration of the encryption parameters, such as different per_sample_lnitialization Vectors_size,
- in schemes supporting partial encryption of Video Coding Layer data (such as sensitive encryption), varying number of protected byte ranges across samples; and
- in schemes supporting multiple key encryption (such as sensitive encryption), varying number of keys used per sample.
The variations are documented in an associated sample group description entry of type‘seig’, and the mapping of each sample to the group is done using the SampleToGroupBox, with a compact version proposed in DAM1 of 14496-12. However, a compact representation for the description of the size has not been studied.
The variations can take various aspects: repeated pattern, single slot variations or burst of the same value. However, the resulting sizes usually cover a well- defined set of values, representing all the possible encryption/encoding configurations.
As can be seen from the above definition, a single variation in the auxiliary sample data size (default_sample_info_size) results in expanding the entire table, which is not very efficient. Therefore, according to particular embodiments, a new version of the‘saiz’ box is defined, enabling simple run-length encoding addressing most use cases, and pattern description for cases where pattern can be used.
aligned(8) class SampleAuxiliarylnformationSizesBox extends FullBox('saiz', version, flags) {
if (flags & 1) {
unsigned int(32) aux_info_type;
unsigned int(32) aux_info_type_parameter;
}
if (version==0) {
unsigned int(8) default_sample_info_size;
unsigned int(32) sample_count; if (default_sample_info_size == 0) {
unsigned int(8) sample_info_size[ sample_count ];
}
} else if (version== 1 ) {
unsigned int(32) entry_count;
for (i=0; i<entry_count; i++) {
unsigned int(8) sample_count_in_entry;
unsigned int(8) si_rle_size;
}
} else if (version==2) {
unsigned int(32) pattern_count;
for (i=0; i < pattern_count; i++) { //pattern definition
unsigned int(8) pattern_length[i];
unsigned int(8) sample_pat_count[i];
}
for (j=0; j < pattern_count; j++) {
for (k=0; k < patternjengthj]]; k++) {
unsigned int(8) si_pat_size[j][k];
}
}
}
}
with the following semantics:
entry count gives the number of entries in the box when version 1 is used; sampie_count_in_entry gives the number of consecutive samples for which the si_rle_size applies. Samples are listed in decoding order. The same remarks as for sample_count applies;
si rie size gives the size in bytes of the sample auxiliary info size for the samples in the current entry;
pattem count indicates the number of successive patterns in the pattern array that follows it. The sum of the included sampie_pat_count [i] values indicates the number of mapped samples;
pattern_length [ i ] corresponds to a pattern within the second array of si_pat_size [ j ] values. Each instance of pattem_iength [ i ] shall be greater than 0;
sampie_pat_count [ i ] specifies the number of samples that use the ith pattern; sampie_pat_count [ i ] shall be greater than zero, and sampie_pat_count [i] shall be greater than or equal to pattern_length [ i ] ;
si pat size [ j ] [k] is an integer that gives the size of the sample auxiliary info data for the samples in the pattern.
When sample_pat_count[i] is equal to pattern_length[i, the pattern is not repeated.
When sample_pat_count[i] is greater than pattern_length[i], the si_pat_size[i][] values of the ith pattern are used repeatedly to map the sample _pat_count[i] values. It is not necessarily the case that sample_pat_count[i] is a multiple of pattern_length[i]; the cycling may terminate in the middle of the pattern.
The total of the sample_pat_count[i] values for all values of i in the range of 1 to pattem count, inclusive, shall be equal to the total sample count of the track (if the box is present in the sample table) or of the track fragment.
An alternative compact representation of the‘saiz’ box, to avoid redefinition of patterns when they reappear after a different pattern. For example, assuming the following patterns ABC DE DE ABC XY ABC, the pattern“ABC” reappears after“DE” pattern. To avoid this, the pattern is referred to through a pattern index as follows; aligned(8) class SampleAuxiliarylnformationSizesBox extends FullBox('saiz', version, flags)
{
if (flags & 1) {
unsigned int(32) aux_info_type;
unsigned int(32) aux_info_type_parameter;
} if (version==0) {
unsigned int(8) default_sample_info_size;
unsigned int(32) sample_count; if (default_sample_info_size == 0) {
unsigned int(8) sample_info_size[ sample_count ];
}
} else if (version== 1 ) {
unsigned int(32) entry_count;
for (i=0; i<entry_count; i++) {
unsigned int(8) sample_count_in_entry;
unsigned int(8) si_rle_size;
}
} else if (version==2) {
unsigned int(32) entry_count;
for (i=0; i < entry _count; /++ ) {
unsigned int(8) pattern_idx[i];
unsigned int(8) sample_pat_count[i];
}
unsigned int(8) pattern_count;
for (j=0; j < pattern_count; j++) {
unsigned int(8) pattern_length[j]
for (k=0; k < patternjengthj]]; k++) {
unsigned int(8) si_pat_size[j][k];
}
}
}
Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data, and comprising auxiliary information associated to the continuous samples;
generating metadata describing the obtained fragment, the metadata defining an auxiliary information size of auxiliary information associated to the continuous samples;
wherein the metadata sub-item comprises a parameter determined as a function of a number of time a pattern is used; and, encapsulating the timed media data and the generated metadata.
According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data.
The optimized version of the‘saiz’ box can be combined with any kind of ‘trun’ box: the standard‘trun’ box, the compact‘trun’ box, or the‘trun’ box relying on patterns.
The reordering indication can be combined with the further optimized compact‘trun’ box according to embodiments of the invention. It can be combined with a compact‘trun’ containing one of the proposed optimizations or all the proposed optimizations for better efficiency. The encapsulated file or segment may further contain a compact‘saiz’ or‘saiz’ box according to embodiments of the invention. In the latter case, the auxiliary data are advantageously placed at the beginning of the mdat. For example, in the case of encrypted content, the encryption information is always available whatever the number of video samples that is sent or received.
The new unit to describe the composition_time_offset may be used with reordering information, whatever the type of ‘trun’ box in use: standard, compact, or relying on patterns. The encapsulated file or segment may further contain a compact ‘saiz’ or‘saiz’ box according to this invention. In the latter case, the auxiliary data are advantageously placed at the beginning of the mdat. For example, in the case of encrypted content, the encryption information is always available whatever the number of video samples that is sent or received. The reodering indication can be combined with the further optimized‘trun’ box relying on patterns according to embodiments of the invention. It can be combined with a‘trun’ box relying on patterns containing one of the described optimizations or all the described optimizations for better efficiency. The encapsulated file or segment may further contain a compact‘saiz’ or‘saiz’ box according to embodiments of the invention. In the latter case, the auxiliary data are advantageously placed at the beginning of the mdat. For example, in the case of encrypted content, the encryption information is always available whatever the number of video samples that is sent or received..
The compact‘saiz’ box may be used with any version of the‘trun’ box: standard‘trun’, compact‘trun’ box, or‘trun’ box relying on patterns. The compact‘saiz’ box may also be used when fragments are reordered as described with reference to Figures 4 to 7. In the latter case, the auxiliary data are advantageously placed at the beginning of the mdat. For example, in the case of encrypted content, the encryption information is always available whatever the number of video samples that is sent or received.
Examples of use of reordered trun boxes
As illustrated in Figure 8, a media presentation description may contain DASH index segments (or indexed segments) describing, in terms of byte ranges, encapsulated ISOBMFF movie fragments, wherein each subsegment comprises a mapping of levels (L0, L1 , L2) to byte ranges.
As illustrated, DASH index file 805 (that may be described in DASH as an index segment or as an indexed segment) 805 provides a mapping (e.g. sidx box 810) of time to byte ranges of encapsulated ISOBMFF segments 800. For each subsegment, a mapping (e.g. ssix box 815) of levels (denoted L0, L1 , and L2, referenced 820) to byte ranges is provided, the levels being declared in a level assignment box denoted‘leva’ (not represented).
The level assignment may be based on the sample group (the assignment_type of the leva box is set to the value 0 and its grouping type is set, for example, to’tele’) describing sub-temporal layers and their dependencies. The so- indexed sub-segments provide a list of byte ranges to get samples of a given sub temporal layer. When described in DASH, for streaming, each level may be described as a SubRepresentation element (as illustrated in XML schema 825).
As set forth above, as the byte-ranges for each level L0, L1 , L2 lead to multiple byte ranges, the access to a specific level is not optimal, because it requires multiple requests from the streaming clients.
Figure 9 illustrates a first example of reordering and mapping samples having levels associated therewith.
According to the illustrated example, the run of reordered samples 900 comprises first samples 905 that correspond to the most important samples for the decoding process. For example, they may correspond to random access points (RAP) or to reference pictures for other (less important) samples. These samples form a contiguous byte range (referenced 905).
The run of reordered samples 900 may contain other sets of samples such as the sets of samples 910 and 915, corresponding to samples having lower levels. Each of these sets corresponds to a contiguous byte range in the media data box of the file. A level is associated with each of them.
As illustrated, index 950 is a sub-segment index box according to ISOBMFF where each set of the corresponding samples 905, 910, and 915 is respectively mapped to a level. According to the illustrated example, each set of samples 905, 910, and 915 is associated with level 0, referenced 955, level 1 , referenced 960, or level 2, referenced 965. This means that the media file may contain, associated with one or more movie fragments (indicated by the field « subsegment_count » of the‘ssix’ box), an index such as index 950. This index provides the byte range to a given level of samples. In this example, the range_count field of ssix 950 is set to 3, corresponding to the number of levels for the sample in the subsegment. This is useful to describe the one or more movie fragments in a streaming manifest, for example in a DASH media presentation description, as a set of alternative encoded versions of the one or more movie fragments.
An additional box denoted‘leva’ (not represented) may also be available in the media file to indicate on which basis the levels are defined: for example on the basis of tracks, sub samples, or sample groups. This depends on the kind of levels. For example, the level may be mapped into a sample group of the‘rap’ type (random access points), then taking assignement_type=0. In the example of Figure 9, the samples 905 would correspond to RAP (Random Access Point) samples while the remaining samples (i.e. the concatenation of samples 910 and 915) would correspond to a single byte range of non RAP samples, i.e. to the samples not mapped into the‘rap‘ sample group. However, this only allows a 2-level mapping: samples in the‘rap‘ sample group versus samples out of the‘rap‘ sample group. The levels may also be mapped into‘tele’ sample group indicating samples with dependencies and samples without any dependency for a given temporal level, layer or sub-layer. The levels may be mapped into a‘trif sample group to describe spatial accesses.
For example, a first level may correspond to a region of interest and another level to the rest of the picture. The samples associated with the level corresponding to the region of interest are considered as having a higher priority than the samples associated with the level corresponding to the rest of the picture. For spatial access, the levels may, alternatively to trif sample groups, be mapped into HEVC tile subtracks. It may be also of interest to map the levels into the leva box onto layers that may be present in the video bit-stream. For example in Layered HEVC, the layer information is encapsulated as a sample group called‘linf for Layer information. The assignment type of the leva box may be set to zero and its grouping type to the‘linf sample group. Alternatively, a dedicated assignment_type may be used for the leva box to indicate that levels map to layers and provide as an additional parameter the four-character code of a box providing layer information. Using such an assignment type makes it possible to map levels to layers independently of the codec in use, i.e. in a generic way. For example, assuming the reordering of a track containing two layers with each two temporal sub-layers, the levels of leva may be mapped as follows {LIDO, TIDO}, {LIDO, TID1 }, {LID1 , TIDO}, {LID1 , TID1 } where LID represents a layer identifier and TID represents a temporal sub layer identifier.
Depending on the coding dependencies, for examples assuming that {LIDO, TID1 } and {LID1 , TIDO} do not depend on each other, the level assignment remains flexible, resulting in either:
leveH = linf{LID0, TIDO}
Ievel2= linf{LID0, TID1}
Ievel3= linf{LID1 , TIDO}
level4=linf{LID1 , TID1 }
or
leveH = linf{LID0, TIDO}
Ievel2= linf{LID1 , TIDO}
Ievel3= linf{LID0, TID1}
level4=linf{LID1 , TID1 }
The level assignment may benefit from additional signaling regarding the dependencies between levels, for example to clarify when a given level depends on another level because it is not necessarily that a level N depends on level N-1 . When assignment type to a dedicated valur for layers, sub-layer or layer and sub-layer, an additional field is provided to indicate a list of dependent levels. This provides a complete map of levels with their dependencies.
In a particular embodiment, a new assignment_type value is defined for the ‘leva’ box to indicate that the levels correspond to a reordered track run box (compact or not). For example, the value assignement_type=5 (or any value not already used for an assignmentjype) may be chosen for such a configuration. Then, the range_count in the ‘ssix’ shall not be greater than the number of sample count declared in the corresponding track run box (compact or not). The association between a trun and the ranges may be done by ssix::subsample_number that provides the index of the trun box. Having the trun reordered reduces the number of ranges to signal. For example, compared to the known DASH formats, each level for a given subsegment could be obtained by streaming clients, from a single request.
The definition of the LevelAssignmentBox, that provides a mapping from features, such as scalability layers, to levels (being noted that a feature can be specified through a track, a sub-track within a track, a sample grouping of a track, or a reordered track run of a track) may then be modified, in the‘leva’ box (section 8.8.13.2 of the DASH standard), as follows:
aligned(8) class LevelAssignmentBox extends FullBox('leva', 0, 0)
{
unsigned int(8) level_count;
for (j=1 ; j <= level_count; j++) {
unsigned int(32) trackJD;
unsigned int(1 ) padding flag;
unsigned int(7) assignment_type;
if (assignment_type == 0) {
unsigned int(32) grouping type;
}
else if (assignment_type == 1 ) {
unsigned int(32) grouping type;
unsigned int(32) grouping_type_parameter;
}
else if (assignment_type == 2) {}
// no further syntax elements needed
else if (assignment_type == 3) {}
// no further syntax elements needed
else if (assignment_type == 4) {
unsigned int(32) sub_track_ID;
}
else if (assignment_type == 5) {
// no further syntax elements needed
}
// other assignment_type values are reserved
}
} According to this example, the semantics of the‘leva’ box is updated with a new assignment type value, for example the value 5. The assignmenMype indicates the mechanism used to specify the assignment to a level. The assignment_type values greater than 5 are reserved, while the semantics for the other values are specified as follows. A sequence of assignmenMype is restricted to be a set of zero or more of type 2 or 3, followed by zero or more of exactly one type.
According to embodiments, the meaning of the assignmenMype values is as follows:
• 0: sample groups are used to specify levels, i.e., samples mapped to different sample group description indexes of a particular sample grouping lie in different levels within the identified track; other tracks are not affected and shall have all their data in precisely one level;
• 1 : similar to the assignmenMype value 0 except that the assignment is done by a parameterized sample group;
· 2, 3: the level assignment is done by track (see the SubsegmentlndexBox for the difference in processing of these levels);
• 4: the respective level contains the samples for a sub-track. The sub tracks are specified through the SubTrackBox; other tracks are not affected and shall have all their data in precisely one level; and
· 5: the respective level contains contiguous samples from a reordered track run. The reordering is specified through ‘trun’ or ‘ctrn’ boxes having the interleave_index_size greater than 0 (for an indication based on a 2-bits flags value) or the samplejnterleavejoit set (for an indication based on a one bit flags value).
The new assignment type to reordered samples in a trun box leads to levels with distinct byte ranges. As such, some levels may have dependencies to other levels. In case, self contained or independent byte range is convenient (for example for single request addressing), another specific value of assignment type may be reserved. For example, when assignmenMype is set to six (or any value not already used for an assignemenMype), the respective levels contain contiguous samples from a reordered track run and each level is self-contained.
Considering for example the samples and levels from Figure 10, a first level with this assignmenMype would correspond to the set of samples 1005, a second level would correspond to the set of samples 1005 and 1010 and a third level would correspond to the whole set of samples. The ssix box is then modified to allow overlapping byte ranges for a given level when the assignmenMype is set to 6. This avoids declaring dependencies in the streaming manifest and allows efficient access to a given level (single request for a given time interval). Since for this specific assignment type, there may be levels that are self_contained and levels that are not, the leva box provides in the declaration of levels the indication whether a given level is self-contained or not. The leva box may be modified as follows:
aligned(8) class LevelAssignmentBox extends FullBox('leva', 0, 0)
{
unsigned int(8) level_count;
for (j= 1; j <= level_count; j ++) {
unsigned int(32) track ID;
unsigned int( 1 ) padding flag;
unsigned int(7) assignment_type;
if (assignment_type == 0) {
unsigned int(32) grouping type;
}
else if (assignment_type == 1) {
unsigned int(32) grouping type;
unsigned int(32) grouping_type_parameter;
}
else if (assignment_type == 2) {}
//no further syntax elements needed
else if (assignment_type == 3) {}
//no further syntax elements needed
else if (assignment_type == 4) {
unsigned int(32) sub_track_ID;
}
else if (assignment_type == 5) {
//no further syntax elements needed
}
else if (assignment_type == 6) {
unsigned int (8) self_contained_level;
}
// other assignment_type values are reserved
}
} This information may be useful when exposing levels in DASH to easily set the Representation dependencies or SubRepresentation dependencies.
Mapping reordered track runs into DASH MPD: SubRepresentation
Based on the generation of sidx, ssix, and leva boxes by the encapsulation module, a DASH packager is able to describe the different levels as DASH SubRepresentation, one per level. The SubRepresentation for lower priority levels may have their dependencyLevel attribute set to SubRepresentation for higher priority levels. This description may be convenient when the MPD provides one SubRepresentation per level of reordering (for example when the leva has an assignment type set to 5). The dependencies between the levels of reordering are reflected by the dependencyLevel attribute of the Representation.
Alternatively, the different priority levels may be described in a ContentComponent element and each SubRepresentation has its ContenComponent attribute referring to one or more of these ContentComponent elements. This may be useful for SubRepresentation providing nested levels (for example with the leva assignment type set to 6). For example, assuming that samples in a track run are reordered according to three levels, a first SubRpresentation provides access to level 1 , a second provides access to levels 1 and 2, and the third provides access to all the levels.
The SubRepresentation’s bandwidth is computed from samples durations and sample sizes for the corresponding byte ranges. The SubRepresentation’s framerate can be computed if one Sub Representation is declared as one or more ContentComponent or as a Sub Representation containing nested or self-contained levels (for example when leva has assignment type set to 6). These fields allow a streaming client to dynamically adapt the transmission by selecting one or a set of SubRepresentation. The SubRepresentation’s maxPlayoutRate attribute may also be set to the value indicating the speed-up of the normal playout rate brought by the associated SubRepresentation.
Mapping reordered track runs into DASH MPD: Representation
Figure 10 illustrates the mapping of reordered track runs into a DASH Representation. For the sake of illustration, the segment 1000 comprises samples reordered according to three priority levels. Such reordering leads to three byte ranges referenced 1005, 1010, and 1015. Each Representation provides an alternative encoded version of this segment in terms of number of samples. By doing so, the segments are self contained and a complete sequence of segments can be obtained from a single Representation. Moreover, the bandwidth and framerate are exact and can be used by the player to select one or another Representation.
Still for the sake of illustration, Representation with id=1 provides a description of only the random access samples, the Representation with id=2 provides a description of a version of the video with a reduced frame, and the last Representation, with id=3, provides a description of the whole segment.
Each segment represented in Figure 1000 may be addressed with a URL template which is convenient for live streaming. Each segment of these alternative Representations may have different availability times. Seamless switching along time is possible between these alternative Representations from one segment to another. With such an approach, the generation of ‘sidx’, ‘ssix’, or ‘leva’ boxes, during the encapsulation, becomes optional but can be convenient for a DASH packager when available. Only some storage overhead at server side is generated by the multiple DASH segments from a single ISOBMFF fragment, but the processing may be even accelerated, for example making available the segments of Representation 1 earlier than those for Representations 2 or 3. This may be the case when the reordering is not performed by the encapsulation server but within a proxy preparing the content for streaming. This proxy may obtain the trun with samples in the decoding order and rewrites the fragment in an appropriate order that is based on the levels, on the fly, while generating the corresponding DASH segments.
Therefore, according to an aspect of the invention there is provided a method for encapsulating timed media data, the method comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data, the samples of the set of contiguous samples being ordered according to a first ordering;
generating metadata describing the obtained fragment; and
encapsulating samples of the set of contiguous samples and the generated metadata, samples of the set of contiguous samples being encapsulated according to a second ordering, the second ordering depending on a level associated with each of the samples of the set of contiguous samples for processing the encapsulated samples, upon decapsulation, wherein the generated metadata comprise reordering information associated with the encapsulated samples for re-ordering the encapsulated samples according to the first ordering, upon decapsulation. Figure 11 is a schematic block diagram of a computing device 1 100 for implementation of one or more embodiments of the invention, in particular all or some of the steps described by reference to Figures 3, 4, 6, and 7. The computing device 1 100 may be a device such as a micro-computer, a workstation or a light portable device. The computing device 1 100 comprises a communication bus connected to:
- a central processing unit (CPU) 1 101 , such as a microprocessor;
- a random access memory (RAM) 1 102 for storing the executable code of the method of embodiments of the invention as well as the registers adapted to record variables and parameters necessary for implementing the method for reading and writing the manifests and/or for encoding the video and/or for reading or generating data under a given file format, the memory capacity thereof can be expanded by an optional RAM connected to an expansion port for example;
- a read only memory (ROM) 1 103 for storing computer programs for implementing embodiments of the invention;
- a network interface 1 104 that is, in turn, typically connected to a communication network over which digital data to be processed are transmitted or received. The network interface 1 104 can be a single network interface, or composed of a set of different network interfaces (for instance wired and wireless interfaces, or different kinds of wired or wireless interfaces). Data are written to the network interface for transmission or are read from the network interface for reception under the control of the software application running in the CPU 1 101 ;
- a user interface (Ul) 1 105 for receiving inputs from a user or to display information to a user;
- a hard disk (HD) 1 106;
- an I/O module 1 107 for receiving/sending data from/to external devices such as a video source or display.
The executable code may be stored either in read only memory 1 103, on the hard disk 1 106 or on a removable digital medium for example such as a disk. According to a variant, the executable code of the programs can be received by means of a communication network, via the network interface 1 104, in order to be stored in one of the storage means of the communication device 1 100, such as the hard disk 1 106, before being executed.
The central processing unit 1 101 is adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to embodiments of the invention, which instructions are stored in one of the aforementioned storage means. After powering on, the CPU 1 101 is capable of executing instructions from main RAM memory 1 102 relating to a software application after those instructions have been loaded from the program ROM 1 103 or the hard-disc (HD) 306 for example. Such a software application, when executed by the CPU 1 101 , causes the steps of the flowcharts shown in the previous figures to be performed.
In this embodiment, the apparatus is a programmable apparatus which uses software to implement the invention. However, alternatively, the present invention may be implemented in hardware (for example, in the form of an Application Specific Integrated Circuit or ASIC).
Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a person skilled in the art which lie within the scope of the present invention.
Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular the different features from different embodiments may be interchanged, where appropriate.
In the claims, the word“comprising” does not exclude other elements or steps, and the indefinite article“a” or“an” does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used.

Claims

1. A method for encapsulating timed media data, the timed media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items, wherein a metadata item comprises a flag indicating whether a data offset is coded on a predetermined size or not, the data offset referring to the timed media data; and,
encapsulating the timed media data and the generated metadata.
2. A method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items, a metadata item of the structured metadata items comprising a configurable parameter having a configurable size, wherein the metadata comprises an indication information indicating whether a sample count field is present or not; and, encapsulating the timed media data and the generated metadata.
3. A method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising:
obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items, a metadata item of the structured metadata items comprising a configurable parameter having a configurable coding size, wherein the metadata comprises a flag indicating the coding size of the configurable parameter; and,
encapsulating the timed media data and the generated metadata.
4. A method for encapsulating timed media data, the timed media data being requested by a client, the method being carried out by a server and comprising: obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data;
generating metadata describing the obtained fragment, the metadata comprising structured metadata items, wherein a metadata item of the structured metadata items comprises an indication information indicating whether a composition time offset parameter is coded as a multiple of a sample duration or of a time scale; and,
encapsulating the timed media data and the generated metadata.
5. The method of any one of claims 1 to 4, wherein the samples of the set of contiguous samples of the timed media data are ordered according to a first ordering, wherein the samples of the set of contiguous samples are encapsulated according to a second ordering, the second ordering depending on a priority level associated with each of the samples of the set of contiguous samples for processing the encapsulated samples, upon decapsulation, and wherein the generated metadata comprise reordering information associated with the encapsulated samples for re-ordering the encapsulated samples according to the first ordering, upon decapsulation.
6. The method of claim 5, wherein the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
7. The method of claim 6, wherein each parameter value of the list is a position index, each position index being determined as a function of an offset and of the coding length of the obtained samples.
8. The method of any one of claims 5 to 7, wherein the samples of the set of contiguous samples are encapsulated using the generated metadata.
9. The method of any one of claims 1 to 8, further comprising obtaining a priority map associated with the samples of the set of contiguous samples, the reordering information being determined as a function of the obtained priority map.
10. A method for encapsulating timed media data, the method comprising: obtaining a fragment of the timed media data, the fragment comprising a set of contiguous samples of the timed media data, the samples of the set of contiguous samples being ordered according to a first ordering;
generating metadata describing the obtained fragment; and
encapsulating samples of the set of contiguous samples and the generated metadata, samples of the set of contiguous samples being encapsulated according to a second ordering, the second ordering depending on a level associated with each of the samples of the set of contiguous samples for processing the encapsulated samples, upon decapsulation,
wherein the generated metadata comprise reordering information associated with the encapsulated samples for re-ordering the encapsulated samples according to the first ordering, upon decapsulation.
11. The method of any one of claims 1 to 10, wherein the format of the encapsulated timed media data is of the ISOBMFF type or of the CMAF type.
12. A computer program product for a programmable apparatus, the computer program product comprising a sequence of instructions for implementing each of the steps of the method according to any one of claims 1 to 1 1 when loaded into and executed by the programmable apparatus.
13. A non-transitory computer-readable storage medium storing instructions of a computer program for implementing each of the steps of the method according to any one of claims 1 to 1 1.
14. A device for transmitting or receiving encapsulated media data, the device comprising a processing unit configured for carrying out each of the steps of the method according to any one of claims 1 to 1 1.
PCT/EP2019/075372 2018-09-20 2019-09-20 Method, device, and computer program for improving transmission of encoded media data WO2020058494A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1815364.3A GB2583885A (en) 2018-09-20 2018-09-20 Method, device, and computer program for improving transmission of encoded media data
GB1815364.3 2018-09-20

Publications (1)

Publication Number Publication Date
WO2020058494A1 true WO2020058494A1 (en) 2020-03-26

Family

ID=64024186

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/075372 WO2020058494A1 (en) 2018-09-20 2019-09-20 Method, device, and computer program for improving transmission of encoded media data

Country Status (2)

Country Link
GB (1) GB2583885A (en)
WO (1) WO2020058494A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113438691A (en) * 2021-05-27 2021-09-24 翱捷科技股份有限公司 TAS frame processing method and device
CN114371674A (en) * 2021-12-30 2022-04-19 中国矿业大学 Method and device for sending analog data frame, storage medium and electronic device
WO2022143615A1 (en) * 2020-12-28 2022-07-07 Beijing Bytedance Network Technology Co., Ltd. Cross random access point sample group
WO2022148651A1 (en) * 2021-01-06 2022-07-14 Canon Kabushiki Kaisha Method, device, and computer program for optimizing encapsulation of images
EP4266690A1 (en) * 2022-04-19 2023-10-25 Nokia Technologies Oy An apparatus, a method and a computer program for video coding and decoding
EP4274245A1 (en) * 2022-05-05 2023-11-08 Lemon Inc. Signaling of preselection information in media files based on a movie-level track group information box
CN117061189A (en) * 2023-08-26 2023-11-14 上海六坊信息科技有限公司 Data packet transmission method and system based on data encryption

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120023250A1 (en) * 2010-07-20 2012-01-26 Qualcomm Incorporated Arranging sub-track fragments for streaming video data
US10044369B1 (en) * 2018-03-16 2018-08-07 Centri Technology, Inc. Interleaved codes for dynamic sizeable headers

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120023250A1 (en) * 2010-07-20 2012-01-26 Qualcomm Incorporated Arranging sub-track fragments for streaming video data
US10044369B1 (en) * 2018-03-16 2018-08-07 Centri Technology, Inc. Interleaved codes for dynamic sizeable headers

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Information technology - JPEG 2000 image coding system - Part 12: ISO base media file format", ISO/IEC 15444-12:2015, IEC, 3, RUE DE VAREMBÉ, PO BOX 131, CH-1211 GENEVA 20, SWITZERLAND, 25 November 2015 (2015-11-25), pages 1 - 233, XP082009945 *
MURIEL DESCHANEL: "CMAF DIS January 2017 - w16632-CMAF Study of DIS- Feb 6 - clean", 12 May 2017 (2017-05-12), XP017853884, Retrieved from the Internet <URL:https://www.dvb.org/resources/restricted/members/documents/TM-IPI/TM-IPI3268_CMAF-DIS-January-2017.zip w16632-CMAF Study of DIS- Feb 6 - clean.doc> [retrieved on 20170512] *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022143615A1 (en) * 2020-12-28 2022-07-07 Beijing Bytedance Network Technology Co., Ltd. Cross random access point sample group
WO2022148651A1 (en) * 2021-01-06 2022-07-14 Canon Kabushiki Kaisha Method, device, and computer program for optimizing encapsulation of images
CN113438691A (en) * 2021-05-27 2021-09-24 翱捷科技股份有限公司 TAS frame processing method and device
CN113438691B (en) * 2021-05-27 2024-01-05 翱捷科技股份有限公司 TAS frame processing method and device
CN114371674A (en) * 2021-12-30 2022-04-19 中国矿业大学 Method and device for sending analog data frame, storage medium and electronic device
CN114371674B (en) * 2021-12-30 2024-04-05 中国矿业大学 Method and device for sending analog data frame, storage medium and electronic device
EP4266690A1 (en) * 2022-04-19 2023-10-25 Nokia Technologies Oy An apparatus, a method and a computer program for video coding and decoding
EP4274245A1 (en) * 2022-05-05 2023-11-08 Lemon Inc. Signaling of preselection information in media files based on a movie-level track group information box
CN117061189A (en) * 2023-08-26 2023-11-14 上海六坊信息科技有限公司 Data packet transmission method and system based on data encryption
CN117061189B (en) * 2023-08-26 2024-01-30 上海六坊信息科技有限公司 Data packet transmission method and system based on data encryption

Also Published As

Publication number Publication date
GB201815364D0 (en) 2018-11-07
GB2583885A (en) 2020-11-18

Similar Documents

Publication Publication Date Title
WO2020058494A1 (en) Method, device, and computer program for improving transmission of encoded media data
KR102406887B1 (en) Method, device, and computer program for generating timed media data
CN110447234B (en) Method, apparatus and storage medium for processing media data and generating bit stream
JP6572222B2 (en) Media file generation method, generation device, and program
US11805302B2 (en) Method, device, and computer program for transmitting portions of encapsulated media content
US11638066B2 (en) Method, device and computer program for encapsulating media data into a media file
US12081846B2 (en) Method, device, and computer program for improving encapsulation of media content
US20220167025A1 (en) Method, device, and computer program for optimizing transmission of portions of encapsulated media content
US20230370659A1 (en) Method, device, and computer program for optimizing indexing of portions of encapsulated media content data
CN113574903B (en) Method and apparatus for late binding in media content
US11575951B2 (en) Method, device, and computer program for signaling available portions of encapsulated media content
WO2022148650A1 (en) Method, device, and computer program for encapsulating timed media content data in a single track of encapsulated media content data
EP4068781A1 (en) File format with identified media data box mapping with track fragment box
GB2620582A (en) Method, device, and computer program for improving indexing of portions of encapsulated media data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19773422

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19773422

Country of ref document: EP

Kind code of ref document: A1