GB2583885A - Method, device, and computer program for improving transmission of encoded media data - Google Patents

Method, device, and computer program for improving transmission of encoded media data Download PDF

Info

Publication number
GB2583885A
GB2583885A GB1815364.3A GB201815364A GB2583885A GB 2583885 A GB2583885 A GB 2583885A GB 201815364 A GB201815364 A GB 201815364A GB 2583885 A GB2583885 A GB 2583885A
Authority
GB
United Kingdom
Prior art keywords
sample
samples
media data
encapsulated
box
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1815364.3A
Other versions
GB201815364D0 (en
Inventor
Denoual Franck
Maze Frédéric
Ouedraogo Naël
Le Feuvre Jean
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Priority to GB1815364.3A priority Critical patent/GB2583885A/en
Publication of GB201815364D0 publication Critical patent/GB201815364D0/en
Priority to PCT/EP2019/075372 priority patent/WO2020058494A1/en
Publication of GB2583885A publication Critical patent/GB2583885A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440227Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by decomposing into layers, e.g. base layer and one or more enhancement layers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • H04L47/62Queue scheduling characterised by scheduling criteria
    • H04L47/625Queue scheduling characterised by scheduling criteria for service slots or service orders
    • H04L47/6275Queue scheduling characterised by scheduling criteria for service slots or service orders based on priority
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/61Network streaming of media packets for supporting one-way streaming services, e.g. Internet radio
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/70Media network packetisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234327Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by decomposing into layers, e.g. base layer and one or more enhancement layers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234345Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440245Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2212/00Encapsulation of packets

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Security & Cryptography (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Encapsulating encoded media data from a server to a client by obtaining samples of the encoded media data, the samples of the encoded media data being ordered according to a first ordering; and encapsulating samples of the obtained samples, ordered according to a second ordering, the second ordering depending on a priority level associated with each of the obtained samples for processing the encapsulated samples upon decapsulation. Reordering information is also encapsulated, for reordering the encapsulated samples according to the first ordering, upon decapsulation. Preferably the media data is timed media data, and the obtained samples correspond to a plurality of contiguous timed media data samples. A priority map may also be obtained, where the reordering information is a function of the priority map. The priority levels may be determined as a function of dependencies between the obtained samples of the media data. Preferably the encapsulated media data is of the ISOBMFF or CMAF type. There are further aspects of the invention for the processing of the encapsulated data, a signal, and a media storage device.

Description

METHOD, DEVICE, AND COMPUTER PROGRAM FOR IMPROVING TRANSMISSION OF ENCODED MEDIA DATA
FIELD OF THE INVENTION
The present invention relates to methods and devices for improving transmission of encoded media data and to methods and devices for encapsulating and parsing media data.
BACKGROUND OF THE INVENTION
The invention is related to encapsulating, parsing and streaming media content, e.g. according to ISO Base Media File Format as defined by the MPEG standardization organization, to provide a flexible and extensible format that facilitates interchange, management, editing, and presentation of group of media content and to improve its delivery for example over an IP network such as Internet using adaptive http streaming protocol.
The International Standard Organization Base Media File Format (ISO BMFF, ISO/IEC 14496-12) is a well-known flexible and extensible format that describes encoded timed media data bitstreams either for local storage or transmission via a network or via another bitstream delivery mechanism. This file format has several extensions, e.g. Part-15, ISO/IEC 14496-15 that describes encapsulation tools for various NAL (Network Abstraction Layer) unit based video encoding formats. Examples of such encoding formats are AVC (Advanced Video Coding), SVC (Scalable Video Coding), HEVC (High Efficiency Video Coding) or L-HEVC (Layered HEVC). Another example of file format extensions is the Image File Format, ISO/IEC 23008-12, that describes encapsulation tools for still images or sequence of still images such as HEVC Still Image. This file format is object-oriented. It is composed of building blocks called boxes (or data structures characterized by a four character code) that are sequentially or hierarchically organized and that define descriptive parameters of the encoded timed media data bitstream such as timing and structure parameters. In the file format, the overall presentation over time is called a movie. The movie is described by a movie box (with four character code moov') at the top level of the media or presentation file. This movie box represents an initialization information container containing a set of various boxes describing the presentation. It is logically divided into tracks represented by track boxes (with four character code 'trak'). Each track (uniquely identified by a track identifier (track_ID)) represents a timed sequence of media data pertaining to the presentation (frames of video, for example). Within each track, each timed unit of data is called a sample; this might be a frame of video, audio or timed metadata. Samples are implicitly numbered in sequence. The actual sample data are in boxes called Media Data Boxes (with four character code tridat) at same level than the movie box. The movie may also be fragmented, i.e. organized temporally as a movie box containing information for the whole presentation followed by a list of couple movie fragment and Media Data box. Within a movie fragment (box with four-character code 'moof') there is a set of track fragments (box with four character code 'tat), zero or more per movie fragment. The track fragments in turn contain zero or more track run boxes (with four character code ttrun'), each of which documents a contiguous run of samples for that track fragment.
The MPEG Common Media Application Format (MPEG CMAF, ISO/IEC 23000-19) derives from ISOBMFF and provides an optimized file format for streaming delivery. CMAF specifies CMAF addressable media objects derived from encoded CMAF fragments, which can be referenced as resources by a manifest. A CMAF fragment is an encoded ISOBMFF media segment, i.e. one or more Movie Fragment Boxes (moo?, tar, etc.) with their associated media data Thdat and other possible associated boxes. CMAF also defines CMAF chunk that is a single pair of 'moor and tridat boxes. CMAF also defines CMAF segments that are addressable media resource containing one or more CMAF fragments.
Media data encapsulated with ISOBMFF or CMAF can be used for adaptive streaming with HTTP. For example, MPEG DASH (for "Dynamic Adaptive Streaming over HTTP") and Smooth Streaming are HTTP adaptive streaming protocols that allow segment or fragment based delivery of media files. The MPEG DASH standard (see "ISO/IEC 23009-1, Dynamic adaptive streaming over HTTP (DASH), Partl: Media presentation description and segment formats") enables to create an association between a compact description of the content(s) of a media presentation and the HTTP addresses. Usually, this association is described in a file called a manifest file or description file. In the context of DASH, this manifest file is a file also called the MPD file (for Media Presentation Description). When a client device gets the MPD file, the description of each encoded and deliverable version of media content can easily be determined by the client. By reading or parsing the manifest file, the client is aware of the kind of media content components proposed in the media presentation and is aware of the HTTP addresses for downloading the associated media content components. Therefore, it can decide which media content components to download (via HTTP requests) and to play (decoding and play after reception of the media data segments). DASH defines several types of segments, mainly initialization segments, media segments or index segments. An initialization segments contain setup information and metadata describing the media content, typically at least the f-typ' and moov' boxes of an ISOBMFF media file. A media segment contains the media data. It can be for example one or more 'moor plus mdat' boxes of an ISOBMFF or a byte range in the tridat box of an ISOBMFF. It can be for example a CMAF segment or an ISOBMFF segment. A Media Segment may be further subdivided into Subsegments (also corresponding to one or more complete 'moot plus mdat boxes). The DASH manifest may provide segment URLs or a base URL to the file with byte ranges to segments for a streaming client to address these segments through HTTP requests. The byte range information may be provided by index segments or by specific ISOBMFF boxes like the Segment Index Box sidx' or the SubSegment Index Box ssix'.
In a classic adaptive streaming over HTTP session, it may happen that a client aborts the transfer of a media segment that cannot be delivered on time. This is especially true when working with low buffer levels. Clients usually handle this situation as follows: -if enough time remains until due display time of the next segment, another lower quality of that segment is fetched. This may arise when the download cancel was performed early enough. The player can only hope to have enough time to fetch the alternate quality; -if not enough time remains, no alternative version of the segment is fetched.
In both cases, if the segment is not fully downloaded, the player either loses the entire segment or tries to decode what was received. This will result in a display freeze, whose duration depend on the amount of lost data.
The present invention has been devised to address one or more of the foregoing concerns and more generally to improve transmission of encoded media data.
SUMMARY OF THE INVENTION
According to a first aspect of the invention there is provided a method for encapsulating encoded media data, the method comprising: obtaining samples of the encoded media data, the samples of the encoded media data being ordered according to a first ordering; and encapsulating samples of the obtained samples, ordered according to a second ordering, the second ordering depending on a priority level associated with each of the obtained samples for processing the encapsulated samples, upon decapsulafion; and reordering information associated with the encapsulated samples for reordering the encapsulated samples according to the first ordering, upon decapsulation. Accordingly, the method of the invention makes it possible to reduce description cost of fragmented media data, in particular of fragmented media data conforming ISOBMFF, and to provide a flexible organisation (reordering) of the media data (samples) with limited signalling overhead. Fragmenting the data and ordering the samples according to a priority level associated with each sample enable transmission of the most important samples first which leads to reducing freeze of video media display when temporal sublayers are split over fragments and transmission errors occur. In addition, the method of the invention makes it possible for the number of different byte ranges with different FEC (forward error correction) settings to be lowered, hence simplifying and improving the FEC part.
In an embodiment, the media data are timed media data and the obtained samples of the encoded media data correspond to a plurality of contiguous timed media data samples, the reordering information being encoded within metadata associated with the plurality of contiguous timed media data samples.
In an embodiment, the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
In an embodiment, each parameter value of the list is a position index, each position index being determined as a function of an offset and of the coding length of the obtained samples.
In an embodiment, the obtained samples are encapsulated using the metadata associated with the samples.
In an embodiment, the method further comprises obtaining a priority map associated with the obtained samples, the reordering information being determined as a function of the obtained priority map.
In an embodiment, obtaining samples of the encoded media data comprises obtaining samples of the media data and encoding the obtained samples of the media data.
In an embodiment, the priority levels are obtained from the encoding of the obtained samples of the media data.
In an embodiment, the priority levels are determined as a function of dependencies between the obtained samples of the media data.
According to a second aspect of the invention there is provided a method for transmitting encoded media data from a server to a client, the media data being requested by the client, the method being carried out by the server and comprising encapsulating the encoded media data according to the method described above and transmitting, to the client, the encapsulated encoded media data.
The second aspect of the present invention has advantages similar to the first above-mentioned aspect.
According to a third aspect of the invention there is provided a method for processing encapsulated media data, the encapsulated media data comprising encoded samples and metadata, the metadata comprising reordering information, the method comprising: obtaining, samples of the encapsulated media data, the obtained samples of the encapsulated media data being ordered according to a second ordering; and reordering information; and reordering the obtained samples in a first ordering according to the obtained reordering information, the first ordering making it possible for the obtained samples to be decoded.
The third aspect of the present invention has advantages similar to the first above-mentioned aspect.
In an embodiment, the media data are timed media data and the obtained samples of the encapsulated media data corresponds to a plurality of contiguous timed media data samples, the reordering information being encoded within metadata associated with the plurality of contiguous timed media data samples.
In an embodiment, the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
In an embodiment, reordering the obtained samples comprises computing offsets as a function of the parameter values and of coding lengths of the encoded samples.
In an embodiment, the method further comprises decoding the reordered samples.
In an embodiment, the method is carried out in a client, the samples of the encapsulated media data and the reordering information being received from a server.
In an embodiment, the format of the encapsulated media data is of the ISOBMFF type or of the CMAF type.
According to a fourth aspect of the invention there is provided a signal carrying an information dataset for media data, the information dataset comprising encapsulated encoded media data samples and reordering information, the reordering information comprising a description of an order of samples for decoding the encoded samples.
The fourth aspect of the present invention has advantages similar to the first above-mentioned aspect.
According to a fifth aspect of the invention there is provided a media storage device storing a signal carrying an information dataset for media data, the information dataset comprising encapsulated encoded media data samples and reordering information, the reordering information comprising a description of an order of samples for decoding the encoded samples.
The fifth aspect of the present invention has advantages similar to the first above-mentioned aspect.
According to a sixth aspect of the invention there is provided a device for transmitting or receiving encapsulated media data, the device comprising a processing unit configured for carrying out each of the steps of the method described above.
The sixth aspect of the present invention has advantages similar to the first above-mentioned aspect.
At least parts of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit", "module" or "system". Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.
BRIEF DESCRIPTION OF THE DRAWINGS
Further advantages of the present invention will become apparent to those skilled in the art upon examination of the drawings and detailed description. It is intended that any additional advantages be incorporated herein.
Embodiments of the invention are described below, by way of examples only, and with reference to the following drawings in which: Figure 1 illustrates the general architecture of a system comprising a server and a client exchanging HTTP messages; Figure 2 describes the protocol stack according to embodiments of the invention; Figure 3 illustrates a typical client server system for media streaming according to embodiments of the invention; Figure 4 illustrates an example of processing carried out in media server and in media client, according to embodiments; Figure 5a illustrates an example of dependencies of video frames, that are to be taken into account for coding or decoding a frame; Figure 5b illustrates an example of reordering samples of a video stream during encoding and encapsulating steps; Figure 6a illustrates an example of steps for reordering samples of an encoded stream in an encapsulated stream; Figure 6b is an example of a data structure used for reordering samples Figure 7 illustrates an example of steps for reordering samples of an encapsulated stream in an encoded video stream; and Figure 8 schematically illustrates a processing device configured to implement at least one embodiment of the present invention
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
A video bit-stream is usually organized into Group Of Pictures (GOP). This is the case for example with MPEG video compression standards like AVC/H.264 or HEVC/H.265. For example, a classical PP°131B2B3 layout at 25 frames per second, with BN having no dependencies on B1, N being an indication of a level, e.g. temporal layer or scalability layer or set of samples with a given priority level, is often encapsulated in a media file or media segments in the decoding order, for example as follows for a one second video: P25° B51 B32 B23 B43 B91 B72 1363 B83 B131 B112 B103 B123 B171 B152 B143 B163 B211 B192 B183 B203 B232 B223 B243 where the indicia indicates the composition or presentation order.
To improve robustness or the user experience, the inventors have observed that some applications benefit from a different sample organization in the data part of the media file (mdat box). An expected layout may be based on priority level values, as follows: P25° B51 1391 B131 B171 B211 B32 B72 B112 B152 B192 B232 B23 B43 B63 B83 B103 B123 B143 B163 B183 B203 B223 B243 To achieve such an organization of the samples using the standard trun' box, one tun' may be used each time the sample continuity is broken, i.e. each time a sample in the expected layout (ordered according to the priority level values) is not the sample following the previous sample in decoding order, as follows: TRUN 11° P25° B51 TRUN B32 TRUN B23 B43 TRUN B91 TRUN B72 TRUN B63 B83 TRUN B131 TRUN B112 TRUN B103 B123 TRUN B171 TRUN 13162 TRUN B143 B163 TRUN B211 TRUN B192 TRUN B183 B203 TRUN B232 TRUN B223 B243 However, it is apparent from such a use of the standard 'tun' box that it induces a non-negligible extra description cost (in this case 17 x 20 = 340 bytes, close to 3 kbits/s). This extra description cost also appears for media resource where the access units have smaller and smaller sizes. This is the case for example when a video stream is spatially split into sub picture tracks or tiles.
To cope with such a problem, the 'tun' box is improved, in particular by adding a sample processing order that avoids such a repetition of the 'trun' box.
It is to be recalled that different types of 'tun' boxes are defined. The ISO/IEC 14496-12 defines the Track Extends Box to define default values used by the movie fragments in a media file or in a set of media segments. The track fragment header box 'tfhd' also sets up information and default values for the runs of samples contained in one movie fragment. A run of samples is described in a trun' box as one or more parameters such as a number of samples, an optional a data offset, optional dependency information related to the first sample in the run, and, for each sample in the run, optional sample_duration, sample_size, sample flags (for dependency/priority), and composition time information. While this may be relevant for sample duration, it is not adapted for sample size, especially when samples correspond to video frames. The MPEG group defining the ISO/IEC 14496-12 is considering further improvements of the 'tun' box. A compact version of the 'tun' box ('ctrrt) is under definition. This compact 'tun' box changes the standard 'tun' box as follows: (a) it provides some configurable field sizes (where the standard trun' systematically used 32 bits per parameter); (b) it changes the loop structure to being a struct that contains a set of arrays, rather than an array of a struct with varying-size fields; and (c) it allow for first-sample variation of all fields, not just the sample_flags parameter (as for example, the first sample size may be much larger).
A track fragment may contain zero or more standard 'trun' or compact trun ctrn' boxes.
In addition to the compact 'tun' box Ictrn', new versions of the 'tun' box are also in consideration in the MPEG, in particular 'tun' boxes relying on patterns. These versions provide the following features: -a capability for indicating the sample flags, sample duration, sample size, and sample composition time offset, each with a configurable number of bytes, for the first sample of a track run. The use of this feature can be controlled with a flags value of the TrackRunBox (i.e. the tr_flags) and -providing one or more track run patterns (in a new TrackRunPatternBox identifief by the four character code ctrup') of per-sample metadata is provided in MovieExtendsBox or MovieFragmentBox. The TrackRunPatternBox enables cyclic assignment of repetitive track run patterns to samples of track runs. One or more track run patterns is specified in the 'trup' box. For each sample in a track run pattern, the sample_duration, sample_flags, sample_composition_fime_offset and the number of bits to encode the sample_size are conditionally provided depending on the box flags.
According to embodiments, a sample processing order is indicated in an encapsulated media data file or in a companion file (e.g. a companion file referencing an encapsulated media data file) to give information about data significance of encapsulated data of the encapsulated media data file, the encapsulated data typically comprising media data and descriptive metadata, so that these encapsulated data may be handled appropriately.
The sample processing order may be used at the server end to organise samples of a fragment of encoded media data, according to their priority, for example for transmitting the most important samples first. At the client end, the sample processing order is used to parse a received encapsulated stream and to provide a decodable stream.
The encapsulated media data may be directed to different kinds of media resources or media components such as an image sequence, one or more video tracks with or without associated audio tracks, auxiliary or metadata tracks.
According to embodiments, the sample processing order associated with a file comprising encapsulated media data are defined in the 'trun' box.
Figure 1 illustrates the general architecture of a system comprising a server and a client exchanging HTTP messages. As illustrated, the client denoted 100 sends an HTTP message denoted 140 to the server denoted 110, through a connection denoted 130 established over a network denoted 120.
According to HTTP, the client sends an HTTP request to the server that replies with an HTTP response. Both HTTP request and HTTP response are HTTP messages. For the sake of illustration, HTTP messages can be directed to the exchange of media description information, the exchange of media configuration or description, or the exchange of actual media data. The client may thus be a sender and a receiver of HTTP messages. Likewise, the server may be a sender and a receiver of HTTP messages.
No distinction is made hereafter between HTTP requests and HTTP responses. However, it is generally expected that HTTP requests are sent on a reliable basis while some HTTP responses may be sent on an unreliable basis. Indeed, a common use-case for the unreliable transmission of HTTP messages corresponds to the case according to which the server sends back to the client a media stream in an unreliable way. However, in some cases, the HTTP client could also send an HTTP request in an unreliable way, for example for sending a media stream to the server. At some point, the HTTP client and the HTTP server can also negotiate that they will run in a reliable mode. In such a case, both HTTP requests and responses are sent in a reliable way.
Figure 2 illustrates an example of protocol stacks of a server 200, for example client 100 of Figure 1, and of a client 250, for example server 110 of Figure 1.
The same protocol stack exists on both client 200 and server 250, making it possible to exchange data through a communication network.
At the client's end (200), the protocol stack receives, from application 205, a message to be sent through the network, for example message 140. At the server's end (250), the message is received from the network and, as illustrated, the received message is processed at transport level 275 and then transmitted up to application 255 through the protocol stack that comprises several layers.
At the client's end, the protocol stack contains the application, denoted 205, at the top level. For the sake of illustration, this can be a web application, e.g. a client part running in a web browser. In a particular embodiment, the application is a media streaming application, for example using DASH protocol, to stream media data encapsulated according to ISO Base Media File Format. Underneath is an HTTP layer denoted 210, which implements the HTTP protocol semantics, providing an API (application programming interface) for the application to send and receive messages.
Underneath is a transport adaptation layer (TA layer or TAL). The TAL may be divided into two sublayers: a stream sublayer denoted 215 (TAL-stream, TA Stream sublayer, or TAS sublayer) and a packet sublayer denoted 220 (TAL-packet, TA Packet sublayer, or TAP sublayer), depending on whether the transport layer manipulates streams and packets or only packets. These sublayers enable transport of HTTP messages on top of the UDP layer denoted 225.
At the server's end, the protocol stack contains the same layers. For the sake of illustration, the top level application, denoted 255, may be the server part running in a web server. The HTTP layer denoted 260, the TAS sublayer denoted 265, the TAP sublayer denoted 270, and the UDP layer denoted 275 are the counterparts of the layers 205, 210, 215, 220, and 225, respectively.
From a physical point of view, an item of information to be exchanged between the client and the server is obtained at a given level at the client's end. It is transmitted through all the lower layers down to the network, is physically sent through the network to the server, and is transmitted through all the lower layers at the server's end up to the same level as the initial level at the client's end. For example, an item of information obtained at the HTTP layer from the application layer is encapsulated in an HTTP message. This HTTP message is then transmitted to TA stream sublayer 215, which transmits it to TA Packet sublayer 220, and so on down to the physical network.
At the server's end, the HTTP message is received from the physical network and transmitted to TA Packet sublayer 270, through TA Stream sublayer 265, up to HTTP layer 260, which decodes it to retrieve the item of information so as to provide it to application 255.
From a logical point of view, a message is generated at any level, transmitted through the network, and received by the server at the same level. From this point of view, all the lower layers are an abstraction that makes it possible to transmit a message from a client to a server. This logical point of view is adopted below.
According to embodiments, the transport adaptation layer (TAL) is a transport protocol built on top of UDP and targeted at transporting HTTP messages.
At a higher level, TAS sublayer provides streams that are bi-directional logical channels. When transporting HTTP messages, a stream is used to transport a request from the client to the server and the corresponding response from the server back to the client. As such, a TA stream is used for each pair of request and response. In addition, one TA stream associated with a request and response exchange is dedicated to carrying the request body and the response body.
All the header fields of the HTTP requests and responses are carried by a specific TA stream. These header fields may be encoded using HPACK when the version of HTTP in use is HTTP/2 (HPACK is a compression format for efficiently representing HTTP header fields, to be used in HTTP/2).
To transfer data belonging to a TA stream, data may be split into TA frames.
One or more TA frames may be encapsulated into a TA packet which may itself be encapsulated into a UDP packet to be transferred between the client and the server. There are several types of TA frames, the STREAM frames carry data corresponding to TA streams, the ACK frames carry control information about received TA packets, and other frames are used for controlling the TA connection. There are also several types of TA packets, one of those being used to carry TA frames.
Figure 3 illustrates an example of a client-server system wherein embodiments of the invention may be implemented. It is to be noted that the implementation of the invention is not limited to such a system as it may concern the generation of media files that may be distributed in any way, not only by streaming over a communication network but also for local storage and rendering by a media player. As illustrated, the system comprises, at the server's end, media encoders 300, in particular a video encoder, a media packager 310 to encapsulate data, and a media server 320. According to the illustrated example, media packager 310 comprises a NALU (NAL Unit) parser 311, a memory 312, and an ISOBMFF writer 313. It is to be noted that the media packager 310 may use a file format other than ISOBMFF. The media server 320 can generate a manifest file (also known as a media presentation description (MPD) file)) 321 and media segments 322.
A the client's end, the system further comprises media client 350 having ISOBMFF parser 352, media decoders 353, in particular a video decoder, a display 354, and an HTTP client 351 that supports adaptive HTTP streaming, in particular parsing of streaming manifest, denoted 359, to control the streaming of media segments 390. According to the illustrated example, media client 350 further contains transformation module 355 which is a module capable of performing operations on encoded bit-streams (e.g. concatenation) and/or decoded picture (e.g. post-filtering, cropping, etc.). Typically, media client 350 requests manifest file 321 in order to get the description of the different media representations available on media server 320, that compose a media presentation. In response to receiving the manifest file, media client 350 requests the media segments (denoted 322) it is interested in. These requests are made via HTTP module 351. The received media segments are then parsed by ISOBMFF parser 352, decoded by video decoder 353, and optionally transformed or post-processed in transformation unit 355, to be played on display 354.
A video sequence is typically encoded by a video encoder of media 300, for example a video encoder of the H.264/AVC or H.265/HEVC type. The resulting bit-stream is encapsulated into one or several files by media packager 310 and the generated files are made available to clients by media server 320.
According to embodiments of the invention, the system further comprises an ordering unit 330 that may be part of the media packager or not. The ordering unit aims at defining the order of the samples so as to optimize the transmission of a fragment.
Such an order may be defined automatically, for example based on a priority level associated with each sample, that may correspond to the decoding order.
It is to be noted that the media server is optional in the sense that embodiments of the invention mainly deal with the description of encapsulated media files in order to provide information about data significance of encapsulated media data of the encapsulated media file, so that the encapsulated media data may be handled appropriately when they are transmitted and/or when they are received. As for the media server, the transmission part (HTTP module and manifest parser) is optional in the sense that embodiments of the invention also apply for a media client consisting of a simple media player to which the encapsulated media file with its description is provided for rendering. The media file can be provided by full download, by progressive download, by adaptive streaming or just by reading the media file on a disk or from a memory. According to embodiments, ordering of the samples can be done by a media packager such as media packager module 310 in Figure 3 and more specifically by ISOBMFF writer module 313 in cooperation with ordering unit 330, comprising software code, when executed by a microprocessor such as CPU 804 of the server apparatus illustrated in Figure 8.
Typically, the encapsulation module is in charge of reading high-level syntax of encoded timed media data bit-stream, e.g. composed of compressed video, audio or metadata, to extract and identify the different elementary units of the bit-stream (e.g. NALUs from a video bit-stream) and organizes encoded data in an ISOBMFF file or ISOBMFF segments 322 containing the encoded video bit-stream as one or more tracks, wherein the samples are ordered properly, with descriptive metadata according to the ISOBMFF box hierarchy. Another example of encapsulation format can be the Partial File Format as defined in ISO/IEC 23001-14.
Signaling sample reordering using 'tun' box As described above, these exist several types of the 'tun' box among which the standard version, the compact version, and the 'trun' box using patterns. While signaling of the sample reordering depends on the type of the 'trun' box used, the reordering of the samples itself, by the server or the client, is the same for all the types of 'tun' box.
Figure 4 illustrates an example of processing carried out in media server 400 and in media client 450, according to embodiments.
As illustrated, a video stream is encoded in a video encoder (step 405), that may be similar to media encoder 300 in Figure 3. The encoded video stream is provided to a packager, that may be similar to ISOBMFF writer 313 in Figure 3, to be encapsulated into a media file or into media segments (step 410). As illustrated, the encapsulating step comprises a reordering step (step 415) during which the samples are reordered according to the needs of an application. The encapsulated stream, wherein the samples are ordered according to the second sample order, may be stored in a repository or in a server for later or live transmission (step 420), with their descriptive metadata allowing reorganization of the samples according to the first sample order is stored. The transmission may use reliable protocols like HTTP or unreliable protocols like QUIC or RTP. The transmission may be segment-based or chunk-based depending on the latency requirements.
The encoder and the packager may be implemented in the same device or in different devices. They may operate in real-time or with a low delay.
According to particular embodiments, the packager re-encapsulates an already encapsulated file to change the encapsulation order of the samples, so as to fit with application needs or re-encapsulates the samples just before their transmission over a communication network.
Encapsulating step 410, comprising reordering step 415, aims at placing the encoded video stream into a data part of the file and at generating description metadata providing information on the track(s) as well as description about the samples. The video stream may be encapsulated with other media resource or metadata, following the same principle: sample data is put in the data part (e.g. rndat box) of the media file or media segment and descriptive metadata (e.g. moovi, 'trak' or moof, 'traf') are generated to describe how the sample data are actually organized within the data part.
Reordering step 415 reorders samples that are received in a first order, for example the order defined by the encoder, and reorganizes these samples according to a second order that is more convenient. For the sake of illustration, such convenience may be directed to the storage (wherein all the Intra are stored together, for example), the transmission (all the reference frames or all the base layers are transmitted first), or the processing (e.g. encryption or forward error correction) of these samples. For example, the second order may be defined according to a priority level or an importance order. Still for example, encapsulating step 410 may receive ordering information or a priority map from a user controlling the encapsulation process through a user interface, for example via the ordering unit 330 in Figure 3. Such a priority map or ordering information may also be obtained from a video analytics module running on server 400 and analyzing the video stream. It may determine the relative importance of the video frames by carrying out a deep analysis of the video stream (e.g. by using NALU parser 311 in Figure 3) or by inspecting the high level syntax coming with the encoded video stream. It may determine the relative importance between the video frames because it is aware of the encoding parameters and configurations.
When the encapsulating step is directed to re-encapsulating media files previously encapsulated according to a first sample order, the packager may use a priority map accompanying this media file or even a priority map embedded within the media file, for example as a dedicated box. Typically, a priority map provides relative importance information on the video samples or on the corresponding byte ranges of theses samples. Alternatively, when present in the sample description, the packager may use information in the sample _flags parameter, when it is present to obtain information on dependencies (e.g. sample is depended on) and degradation priority. For example, when sample is depended on is equal to 1 or when sample_is non sync sample is equal to 0, the sample is considered as having high priority and may be stored at the beginning of the media data for the fragment. Alternatively, it may use information from the SampleDependencyTypeBox or DegradationPriorityBox. For example, when the sample_has redundancy is equal to 1, the sample is considered to have low priority and may be stored rather at the end of the media data for the fragment. Once reordered according to priorities, the sample flags may be removed from the sample description to
compact the fragment description.
An example of reordering steps is described in reference to Figure 6a. According to this example, the packager inserts ordering information within the descriptive metadata describing the samples, in terms of byte position (data_offset) and length (sample_size), duration, composition time (for example a 'trun' box). According to embodiments, ordering information comprises an index of the samples in the data part, according to the second sample order. An example of reordering is described by reference to Figure 5b.
Conversely, client 450 reads or receives an encapsulated stream (step 455), wherein the samples are ordered according to the second sample order. The encapsulated stream is parsed (step 460), the parsing step comprising a reordering step (step 465) to reorganize the samples according to the first sample order so that the de-encapsulated stream can be decoded (step 470).
Figure 5a illustrates an example of dependencies of video frames, that are to be taken into account for coding or decoding a frame.
For the sake of illustration, each video frame is represented with a letter and one or more digits where the letter represents the frame coding type and the digits represent the composition time of the video frame. This frame organization is the classical B-hierarchical scheme from M PEG video compression codecs like HEVC. It is to be noted that it may be used for different types of I/P/B frames and prediction patterns.
The arrows between two video frames indicate that the frame at the start of the arrow is used to predict the frame at the end of the arrow.
For example, frame 83 depends on frames 82 and 84, frame 132 depends on frame /0 and frame 84, and frame 84 depends on frames 10 and Pe. Accordingly, frame 83 can be decoded only after frames /0, P8, and 84, being noted that frame 84 can be decoded only after frames lo and Pa have been decoded and frame Pa can be decoded only after frame lo.
Figure 5a also illustrates the layers or priority levels of each frame.
Figure 5b illustrates an example of reordering samples of a video stream during encoding and encapsulating steps. According to this example, the samples correspond to the frames illustrated in Figure 5a. For the sake of clarity, the samples are represented using the reference of the frames but it is to be understood that the samples in the streams are a sequence of bits, without explicit references to the frames.
As it is apparent from video stream 500, the samples are ordered according to the position of the frames in the video stream. For example, the sample corresponding to frame 10 is located before the sample corresponding to frame 8/ that is located before the sample corresponding to frame 82 because frame lo should be displayed before frame 8/ that should be displayed before frame 82 and so on.
In order to optimize coding and decoding, the order of the encoded frames preferably depends on the dependencies of the frames. For example, the sample corresponding to frame 84 should be received before the sample corresponding to frame 82, although frame 84 is displayed after frame 82 because frame 84 is needed to decode frame 82.
Therefore, the samples corresponding to the encoded frames are preferably ordered as a function of the dependencies of the frames, as illustrated with reference 505 that represents the encoded video streams.
Usually, the decoding order corresponds to the sample organization in encapsulated files or segments (for example in CMAF or ISOBMFF). This sample order provides a compliant bit-stream for video decoders. Changing the sample order without any indication in the descriptive metadata may lead to non-compliant bit-streams after parsing and may crash video decoders.
As described above, there exist cases for which the order of the encoded samples is advantageously modified, for example to make it possible to send the most important samples first. An example of such sample reordering is illustrated with reference 510 in Figure 5b. In this example, the samples are reordered according to the layers or priority levels (the frames corresponding to the first layer or to the first priority level are transmitted first, then the frames corresponding to the second layer or to the second priority level are transmitted and so on).
When encoded samples are reordered, the parser must be aware of the modified encoded stream so as to make sure that the output of the parser is a bit-stream compliant with the decoder.
According to embodiments, the indication of the order change, also referred to as interleaving (or reordering) information, is included in the descriptive metadata of the encapsulated file or segment. As illustrated with reference 515, such an indication may comprise a list of indexes of the position of the sample in the encapsulated stream ('mdat box). For example, as illustrated with reference 520, the parser may determine that the 10th sample of the encoded stream corresponds to the 3 sample in the encapsulated stream (that is to say to location '2' if locations are indexed from 0). Therefore, by using the list of indexes, the parser may reconstruct the encoded stream from the encapsulated stream.
According to other embodiments, the indexes of the list of indexes correspond the location (the order) of the samples of the encapsulated stream in the encoded stream.
In these other embodiments, the sample description follows the order of the encapsulated stream 510 which requires the parser to reorder the samples data before transmission to the decoder. This reordering uses the list of indexes providing the location (order) of the samples in the encoded stream. For parsers to identify which order (encapsulation or encoding) indexes are provided in the list of indexes 515, the packager may include an indication at file level (e.g. in 'ftyp' box) so that the parsers are able to identify which order (encapsulation or encoding) indexes are provided in the list of indexes 515. The identification may be for example a brand or a compatible brand. Such a brand indicates reordered samples and the actual order used by the packager. For example, a brand 'reor' (this four character code is just an example, any reserved code not conflicting with other four character codes already in use can be used) indicates that samples are reordered. For example, a brand 'mot indicates presence of reordering as in 515 (from decoding order to encapsulation order) while, still for example, a brand creo2' indicates presence of reordering from encapsulation order 510 to decoding order 505. This indication that encapsulation has been done with reordering may alternatively be included in the box indicating presence of movie fragments (e.g. the 'mvex' or mehd' box). For example, it may be an additional field (for example reordering_type) in new versions of the mehd': aligned(8) class MovieExtendsHeaderBox extends FullBoxemehd; version, 0) { if (version==3) unsigned int(64)fragment duration; unsigned hit (8) reordering type; ) else if (version==2) unsigned int(32)fragment duration; unsigned int (8) reordering type; 10) else if (version==1) unsigned int(64)fragment duration; ) else (Ilversion==0 unsigned int(32)fragment duration; 15) with a reordering type value set to 0 indicates that there is no reordering. When it is set to 1, there is reordering with a list of indexes providing mapping from decoding order 505 to encapsulation order 510 (as 515) and when it is set to 2, there is reordering with a list of indexes providing mapping from encapsulation order 510 to decoding order 505. Other values are reserved.
The packager may decide to reorder only some fragments. In other words, an encapsulated media file or segment may contain fragments with reordering and fragments without reordering.
According to embodiments, the list of indexes is stored in a 'trun' box (e.g. a standard 'tun' box, a compact 'tun' box, or a 'tun' box relying on patterns) when the media file is fragmented.
To extract a sample from the encapsulated stream, the offset of the first byte for this sample in the mdat box is computed from the corresponding index (by multiplying the value of the index by the size of a sample when sample size is constant across all samples of the run). When it is not constant across the samples, the size of the sample is provided in the 'trun' box through the sample size field The sample duration and sample composition time offset may also be provided.
Figure 6a illustrates an example of steps for reordering samples of an encoded stream in an encapsulated stream. These steps may be carried out in a server, for example in a writer or a packager.
As illustrated, a first step is directed to _initializing the packaging module (step 600). For example, the packaging may be set to a particular reordering configuration by a user or through parameter settings during this step. An item of information provided during this step is whether the encapsulation is to be done within fragments or not, When the encapsulation is to be done within fragments, the fragment duration is obtained.
These items of information are useful for the packaging module to prepare the box structure, comprising or not the 'moor and tat boxes. When samples are reordered, the packaging module sets the dedicated flags value in the 'tun' box with the appropriate value, as described here after.
During initialization, the packaging module receives the number of priority levels to consider and may allocate one sample list per level, except for the first level (for which samples will be directly written in the tridat box). Such an allocation may be made, for example, in memory 312 (in Figure 3) of the server. The packaging module then initializes two indexes denoted index first and index second, respectively providing the sample index in the first order and the sample index in the second order. According to the given example, both indexes are initialized to zero. A mapping table (as illustrated in Figure 6b) for reordering the samples is also allocated with its size corresponding to the number of samples expected per fragment (e.g. the sample_count related to the fragment duration) multiplied by one plus the number of levels. As illustrated in Figure 6b, mapping table 690 contains one table per level (denoted 692, 693, 694) and a mapping list (denoted 691) indexed by the sample index in the first order and providing access to the appropriate table per level (reference 692 or 693 or 694), given the sample level (e.g. i-th sample with level / is stored in mapping_tablelln. Each cell of a table per level can store the sample description (e.g. reference 695) and the sample data (e.g. reference 696). It is to be noted that to allocate less memory, mapping table 690 may store the sample descriptions in mapping list 691 in addition to the level indication (instead of one of the 695 cell).
Next, the samples are processed one after another. To that end, the packaging module reads the next sample to be processed (step 605), corresponding, at the beginning, to the first sample of the encoded stream.
The priority level of the read sample is then obtained and index index first is incremented by one (step 610). The corresponding list (692, 693 or 694) in mapping_table[sample level] is then selected at step 615. The sample description is computed and stored in the mapping table (one of 695) at step 620. For example, the sample description may be, depending on the values of the 'tun' box's flags, "tr_flags", the position of the first byte for the sample, the duration of the sample, the size of the sample, or the value of index index second. A composition time offset may also be provided. The index index_second is initially set to zero. The bytes corresponding to the sample data are also stored in the mapping table (one of 695) at step 625. Description and data are stored at mapping tablefsample leveliffirst index]. Then the first index is incremented by one (630).
This process iterates until the end of the fragment is not reached (i.e. the result of test 640 is false) by reading another sample from the encoded stream. When the end of the fragment is reached (i.e. the result of test 640 is true), the packager flushes the data buffered in the mapping table 690 (step 640). This mapping table is used for reordering samples. At this stage, index index second for the stored samples is not known and still sets to zero. The packaging module starts flushing the data buffered during steps 620 and 625 (step 645) from a lower level (high priority or most important samples) to a higher level (low priority or less important samples) as follows: list table 691 is read and only the sample pertaining to the current level are candidate to flush (step 650). One after the other, their sample data (read from one of 696) are written into data part of the media file or segment, i.e. mdat box, and their corresponding sample description (read from one of 695) is written into the descriptive metadata part (step 655), for example in the 'tun' box or in any box describing the run of samples.
According to embodiments, the sample description contains, in addition to usual parameters (e.g. sample duration, sample size, etc.), the reordering information (or interleaving index) that is set to the current value of index index second maintained by the packaging module (actually the position of the last written sample in the data part plus one). The interleaving index shall be unique within the 'trun' box. Each time a sample is flushed (step 655), the index index second is incremented by one. The packager iterates over the level (test 660).
At the end, the 'tun' box contains reordering for the samples of the fragment. When all buffered samples for all levels have been flushed, the packaging module finalizes the encapsulation of the fragment that is stored or transmitted, depending on the application.
Examples of trun' box are provided after the description of the parsing process (Figure 7).
Figure 7 illustrates an example of steps for reordering samples of an encapsulated stream into an encoded video stream. These steps are carried out in a client, at the reader end, for example in a parser.
As illustrated, a first step aims at receiving one encapsulated fragment (step 700). According to embodiments, the parser reads the descriptive metadata part (step 705), for example from the 'moor box and its sub-boxes (e.g. ftraf' and 'trun' boxes). Using this information, the parser determines whether reordering has to be performed or not (step 710). This may be done, for example, by checking the "tr flags". If no reordering has to be done, the run of samples can be extracted from the rndat box (step 715), sample after sample, from the first byte offset to the last byte, on a standard basis. The last byte position is computed by cumulated sample sizes from the first sample to the last sample of the fragment as indicated by "sample_count" in the 'tun' box.
On the contrary, when samples are to be reordered, the sample description is read (step 720) so as to read each sample index in the expected output order (step 725). Accordingly, reordering information is read by reading the sample reordering information inserted in the 'tun' box (for example in a standard 'tun' box, in a compact 'trun' box, or in a 'trun' box relying on patterns, as explained below). Indeed, as described by reference to Figure 4 On particular from reference 410) and according to particular embodiments, the encapsulation step assigns an interleaving index k, in the range [0, trunzsample_count -1], to each sample. The sample sizes read are available through a "sample_size[]" variable, where the ith entry gives the size of the ilh sample in decoding order in the trun (first sample stored at i = 0). The trun has its data offset sets on a standard basis (e.g. beginning of the moof or the default data_offset potentially present in the track fragment header box or the data_offset provided in the trun) and the sample_data_offset, relative to this data offset, for the current sample with interleave index k is computed (step 730) as follows: sample_data_offset=0 for (1=0; i<k; i++) sample data_offset += sample sizepdx to_sample num pi I where idx to_sample num provides for a sample index in the second order (encapsulated order) the index of this same sample in the first order (e.g. decoding order). These indexes are computed sample after sample by the parser during steps 725 and 730.
From the so-computed sample data offset, the parser can extract the number of bytes corresponding to sample size and provide it to the decoder.
As illustrated, the parser iterates over all the samples (step 735) until the end of the fragment. The process continues until there is no more any fragment to process.
When one fragment is processed, the extracted video stream can be processed by the video decoder.
Signaling sample reordering using a standard ?run' box In particular embodiments, the reordered fragments are described within the standard tun' box. To indicate reordered samples, the standard 'tun' box is modified as follows. First a new flag value for the 'tun' box is defined. For example, the value Ox100000 is reserved with the name SAMPLE_INTERLEAVE_BIT to indicate that samples data are stored in a different order than the decoding order. It is to be noted that the flag value here are provided as examples, any reserved value and name may be used provided that it does not conflict with other reserved flags values for the tun' box. In addition to the new flags value, the trun contains an additional parameter. This can be handled as a new version of the 'trun' box, as follows On bold): aligned(8) class TrackRunBox extends FullBox(struni, version, tr flags) { unsigned int(32)sample count;
//the following are optional fields
signed int(32) data offset; unsigned int(32)first sample_flags;
// all fields in the following array are optional
// as indicated by bits set in the tr flags // in particular the indication for reordering unsigned int(32)sample_duration; unsigned int(32)sample_size; unsigned int(32)sample_fiags if (version == 0) (unsigned int(32) sample_composition_time_offset; else (signed int(32) sample_composifion_time_offset; // presence depending on tr flags value if (version == 2) ( unsigned int(32)sample_interleave_index; )[ sample_count where sample_interleave_index indicates the order of sample interleaving in the trun' box. A value of 0 indicates that the sample data start at the trun data offset. A value of K>0 indicates that the sample data start at the trun data offset plus the sum of the size of all samples with an interleaving index strictly less than K. There shall not be two samples with the same interleaving index in the same 'tun' box.
The semantics of the other parameters of the 'trun' box remains unchanged.
Signaling sample reordering using a compact, or optimized, 'trun' box.
In particular embodiments, the reordered fragments are described with the compact 'trun' box. To indicate reordered samples, the compact 'trun' box is modified as follows (in bold): aligned(8) class CompactTrackRunBox extends FullBox('ctmc version, ti _flags) // all index fields take value 0,1,2,3 indicating 0,1,2,4 bytes unsigned int(2)duration_size index; unsigned int(2) sample size_index; unsigned int(2) flags size index; unsigned int(2)composition_size index; // sample interleaving takes value 0,1,2,3 indicating 0,1,2,4 bytes // a value of 0 means no interleaving unsigned int(2)intedeave_index size; unsigned int(30) sample_count;
// the following are optional fields
if (data_offfset present) signed int(32) data offset; if (first sample_info_present) unsigned int(32)first sample_size; unsigned int(32)first sample_flags; // all the following arrays are effectively optional
II as the field sizes can be zero
unsigned int(f(duration_size_index)) sample_duration[ sample_count]; unsigned int(f(sample_size_index)) sample_size[ sample_count -(first sample_info_present ? 1: O)]; unsigned int(f(flags size_index)) sample_flags[ sample_count -(first sample_info present ? 1: 0)]; if (version = 0) (unsigned int(f(composition_size_index)) sample composition_time_offset[ sample_count]; ) else (signed int(f(composition_size_index)) sample composition_time_offset[ sample_count]; ) If (interleave_index size) ( unsigned int(f(interleave_index size)) sample_interleave_index f sample_counth 15) where sample_interleave_index indicates the order of sample interleaving in the trun. A value of 0 indicates that the sample data start at the trun data offset. A value of K>0 indicates that the sample data start at the trun data offset plus the sum of the size of all samples with an interleaving index strictly less than K. There shall not be two samples with the same interleaving index in the same trun. The semantics of the other parameters of the compact 'tun' box remains unchanged.
Accordingly, the sample interleave index provides, for each sample in the run of samples, the index of the position of this sample in the media data part of the file or the segment (e.g. in rndaf box). It is to be noted that for the sake of compaction, the 32 bit sample count can be moved into a 30 bit field. In embodiment variants, the sample count is still encoded on 32 bits and 8 more bits can be allocated to store the interleaving_index size in the proposed box. In the compact 'tun' box, when the packager set the value of intereleave index size to 0, the parser may interpret it as meaning there is no need to apply a reordering step. When the value set by the packager is different than 0, the parser has to read it from the descriptive metadata and to apply a reordering steps to make sure to produce a compliant bit stream for the video decoder.
Signaling sample reordering using a 'trun' box relying on patterns Since most encoders provide regular temporal GOP structure, the inventors realized that when the sample description uses 'tun' box relying on patterns, the interleaving information could be added at no cost in such 'tun' box. This is achieved by modifying the proposed TrackRunPatternStruct and adding a new flag to the TrackRunPatternBox. For example, the box containing the declaration of the patterns defines the following flag value: Ox100000 SAMPLE INTERLEAVE_BIT, when set, it indicates that samples data are stored in a different order than the decoding order. It is to be noted that the flag value here are provided as examples, any reserved value and name may be used provided that it does not conflict with other reserved flags values for the 'tun' box.
The definition of a given pattern according to embodiments of the invention is modified as follow to support the reordering of fragments On bold): aligned(8) class TrackRunPatternStruct(version, patIdx, numSamples, boxFlags, 10 numBitsSampleCount) for (i = 0; i < numSamples; i++) if (boxFlags & SAMPLE DURATION PRESENT) unsigned int(numBitsSampleDuration) sample duration[patIdxffi]; if (boxFlags & SAMPLE FLAGS PRESEN7) unsigned int(32)sample flags[patidx][il; if (boxFlags & SAMPLE CT OFFSETS PRESEN7) if (version == 0) signed int(numBitsCTOffset) sample composition time_offsetThafidxfill; else unsigned int(numBitsCTOffset) sample composition_time_offset[patIde If (boxFlags & SAMPLE_INTERLEAVE_BIT) unsigned int(numBitsSampleCount) sample_interleave_index fpatIdxffil; if (boxFlags & SAMPLE SIZE PRESEN7) for(i = 0; i < numSamples; i++) unsigned int(4) num sample size nibbles minus2ThatIdxffit numBitsSampleSize[patIdxffil = (num sample_size_nibbles minus2IpatIdxffil + 2) 4; if (numSamples % 2) bit(4) reserved = 0; where sampleinterleaveindex indicates the order of sample interleaving in the 'trun' box. Value of 0 indicates that the sample data start at the 'tun' data offset. A value of K>0 indicates that the sample data start at the 'trun' data offset plus the sum of the size of all samples with an interleaving index strictly less than K. There shall not be two samples with the same interleaving index in the same 'tun' box.
This design has the advantage of not using any bit in the 'tun' box relying on patterns to indicate the sample interleaving or reordering. For dynamic use cases (GOP structure varying), either a new template could be updated or the compact 'tun' run could be used.
The semantics of the TrackPattemStruct are unchanged.
Optimization of the ?run' box The repetition of fragments in movie files and the fact that they are getting shorter (typically half a second of video in CMAF), the description cost of fragments may become significant. Initial proposals to optimize the 'trun' box are not so exhaustive. The inventors have observed that there are fields that may compressed even more, whatever the type of 'tun' box (e.g., standard 'tun' box, compact tun' box or 'tun' box relying on patterns).
According to embodiments, the following fields may be optimized: data_offset: the number of bits for the description of the data_offset for a run of samples. According to embodiments, it is coded on a number of bits lower than the 32 bits used in the standard 'tun' box. Indeed, the inventors have observed that both the pattern 'tun' box and the compact 'trun' box designs still use 32 bits for the data offset, which is very large in DASH/CMAF cases since the base offset is the start of the 'moot box. In most cases, 16 bits is more than enough. Therefore, according to embodiments, the possibility is given to use a smaller field, signaled through a new 'trun' flags value. For example, the value Ox100000 may be defined and reserved for a flag value denoted "DATA_OFFSET_16" that indicates, when set, that the data offset field is coded on a short number of bits (e.g., 16 bits). This flags value shall not be set if the data_offset present flags value of the TrackRunBox is not set and the base-data-offset-present flags of the TrackFragmentHeaderBox is not set; Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising: obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data; generating metadata describing the obtained fragment, the metadata comprising structured metadata items (e.g., boxes), wherein a metadata item (e.g., the trun box) comprises a flag (e.g., DATA OFFSET 16) indicating whether a data offset is coded on a predetermined size or not; and, encapsulating the timed media data and the generated metadata.
According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data sample_count: the number of samples for a run of samples. According to embodiments, it is coded by a smaller number of bits since the fragments and then runs of samples are becoming smaller (less samples) for low latency purpose. This is for example the case of CMAF fragments that can correspond to 0.5 second of video, then "only" 15 samples for a video at 30 Hertz or 30 samples for a video at 60 Hertz. Using 32 bits is not so efficient since in most cases, a lower number of bits such as 8 or 16 bits would be sufficient. According to embodiments, the sample count field is of variable or configurable size. Moreover, the sample count from one fragment to another may remain the same. This may be the case when the GOP structure used by the video encoder is constant along over the time. In such a case, a default number of sample count can be defined and overloaded when necessary. Since this item of information may be useful for the whole file, it is set by the encapsulation or packaging module in a box of the initialization segment. It can be for example inserted in a new version of TrackExtendBox 'trek or in any box dedicated to the storage of default values used by the movie fragments: aligned(8) class TrackExtendsBox extends FullBox('trext, version, 0){ unsigned int(32)track ID; unsigned int(32)default sample_description index; unsigned int(32)default sample_duration; unsigned int(32)default sample_size; unsigned int(32)default sample_flags; if (version == 1) { // complete list of default values unsigned int (32) default sample_count; unsigned int (32) default data_offset; unsigned int (32) default first sample_size; unsigned int (32) default first sample_flags; unsigned int (32) default composition time_offset; where the default parameters have the same semantics as in the 'tun' box. It is to be noted that when there is no handling of the first sample in the trun, the default t7rst sample_size and default t7rst sample flags parameters could be omitted. To be able to distinguish between the two modes: first sample inside the loop on samples or first sample outside the loop, two versions of the new itrex' box may be used. In such case version = 1 defines the following list of default values: default sample_count; default data_offset; default composition_time_offset; when version >=1, the additional parameters to handle first sample are provided: default first sample_size; default first sampleilags.
This new trex' box may be used with the compact 'tun' box or the 'fruit box relying on patterns.
The presence or absence of sample_count information in the description of a run of samples may be indicated by a dedicated flags value of the 'tun' box.
Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising: obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data; generating metadata describing the obtained fragment, the metadata comprising structured metadata items (e.g., boxes), a metadata item of the structured metadata items (e.g., the trun pattern box) comprising a configurable parameter having a configurable size, wherein the metadata comprises an indication information (e.g., SAMPLE_COUNT_PRESENT) indicating whether a sample count field is present or not; and, encapsulating the timed media data and the generated metadata.
According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data.
Specific embodiments depending on the kind of the 'trun' box (standard, compact, or relying on patterns) are described below.
The following embodiments can be implemented using different versions of the 'tun' box or any equivalent box to describe a run of samples.
In a variant, the box structures describing the fragments, especially the run of samples, do not comprise flag value indicating presence or absence of some parameters. Instead, an exhaustive list of default values is defined for any parameter describing the run of samples, in a fragment. It can be done at the file level (e.g. trex box or equivalent) when applying to the whole file. We call this variant the "exhaustive default values mode".
In this variant, depending on the number of bits in use for the fields describing runs of samples, the default value may be overloaded for a given fragment or even only for a given sample in the run of samples. Having a variable number of bits from 0 to 32, combined with the exhaustive list of default values, avoids parsers the tests on flag values to determine whether a parameter describing the run of samples is present or not in the 'trun' box. For example, assuming the parser just reads the sample_duration value for a given sample, the parser has to determine whether a next parameter is present or not (since their presence is conditioned to the flag value in the 'trun' and/or in the 'ffhd' boxes). Before reading the next parameter, the parser has to check presence or absence of the sample size in the trun' box. This requires checking whether the tun' box has a predetermined value such as 0x000200 (for indication of sample-size-present) of its flags sets. If not, the parser has to further check whether the track fragment header box contains a default value for the sample size. Then, depending on the results of these tests, the parser may have to interpret a parameter in the trun' as the sample size parameter for the current sample. This exhaustive default value mode with the variable number of bits between 0 to 32 for each parameter avoids carrying out these tests. By default, when parsing a sample, the default values are set. Then, the parser, informed at the beginning of the trun parsing of the number of bits in use for each parameter, is able to determine how many bits to parse for a given parameter. This makes the description of run of samples simpler to parse and even more efficient. This variant may be used with the compact 'trun' box or with 'trun' boxes relying on patterns as described later in this invention.
Optimization of the truni box using a compact ?run' box In embodiments, the compact tun' is used to describe the runs of samples within media fragments and is further improved by the optimizations discussed above. The structure of the compact 'tun' box is then modified as follows: aligned(8) class CompactTrackRunBox extends FullBox('ctm; version, tr flags) { // all index fields take value 0,1,2,3 indicating 0,1,2,4 bytes unsigned int(2) duration_size_index; unsigned int(2) sample_size_index; unsigned int(2) flags size_index; unsigned int(2) composition_size_index; unsigned int(2)sample_count size_index if (tr flags & SAMPLE_COUNT PRESEN7) II this flags is optional since number of bits returned by f(sample_count size_index) may be zero unsigned int( f(sample_count size_index)) sample_count;
// the following are optional fields
if (data offset present) If (tr flags & DATA_OFFSET 16)( signed int(16) data_offset; else ( signed int(32) data offset; if (first sample_info_present){ unsigned int(32)first sample_size; unsigned int(32)first sample_flags; // all the following arrays are effectively optional // as the field sizes can be zero unsigned int(f(duration_size_index)) sample duration( sample_count]; unsigned inasample size_index)) sample size[ sample count -(first sample_info_present ? 1 0)]; unsigned int('f(flags size_index)) sample flags[ sample_count -(first sample_info present ? 1 0)]; if (version == 0) (unsigned int(f(composifion_size index)) sample compositiontime_offseff sample_count J; ) else (signed int(f(composition_size_index)) sample composition_time_offseff sample_count I; ) The data_offset may also be provided with a variable number of bits as an alternative to the flags value DATA_OFFSET_16.
In embodiments where the compact ttrun' box is used to describe the fragments, the description of the samples within the fragments, i.e. the runs of samples, is further optimized as described below.
The description of the first sample is using 32 bits to encode the size and the sample_flags values for the first sample in the run of samples. This could be further optimized by using a variable number of bits for these items of information. The packaging module determines the required number of bits and sets the actual value in use to describe the first sample inside the compact 'tun' box. The compact 'tun' box is then modified as follows (with the semantics unchanged): aligned(8) class CompactTrackRunBox extends FullBox('ctrn', version, tr flags) { // all index fields take value 0,1,2,3 indicating 0,1,2,4 bytes unsigned int(2) duration_size index; unsigned int(2) sample size_index; unsigned int(2) flags size index; unsigned int(2) composition_size index; unsigned int(2) sample_count size_index; unsigned int(2) first sample_size_index; I/O if! first-sample-info-present unsigned int(2) first flags size index; I/O if! first-sample-info-present unsigned int(2) reserved=0; unsigned int(ffsample_count size_indexil sample_count;
// the following are optional fields
if (data_offset present) if (tr flags & DATA_ OFFSET_ 16) signed int(16) data_offset; else signed int(32) data_offset; if (first sample_info present) { unsigned int(f(first sample_size index) )first sample_size; unsigned int(f(first flags size_index) ) first sample_flags; In a variant, rather than considering that the compact 'tun' box relies on flag values for presence or absence indication of some parameters, the compact 'trun' box uses the "exhaustive default values" mode. The itrex' box exhaustively defines default values for each field or parameter describing a run of samples. The 2-bit fields for size_index are exhaustive, i.e. all the fields in the compact 'tun' box come with an indication of the number of bits used to encode them. This makes parsing simpler by avoiding some tests carried out on a sample basis. The CompactTrackRunBox is modified as follows: aligned(8) class CompactTrackRunBox extends FullBox('ctm; version, tr_flags)( // all index fields take value 0,1,2,3 indicating 0,1,2,4 bytes unsigned int(2) duration_size_index; unsigned int(2) sample_size_index; unsigned int(2) flags_size_index; unsigned int(2) composition_size_index; unsigned int(2) sample_count_size_index; unsigned int(2) first_sample_size_index; unsigned int(2) first_fiags_size_index; unsigned int(2) data_offset_size_index; unsigned int(f(sample_count_size_index)) sample_count; signed int (f(data_offset_size_index) ) data offset; unsigned int(f(first_sample_size_index) ) first sample_size; unsigned int(f(first_tlags_size_index) ) first sample_flags; if all the following arrays are effectively optional
// as the field sizes can be zero
unsigned int(f(duration_size_index)) sample_duration[ sample_count unsigned int(f(sample_size_index)) sample size( sample count - (first_sample_info_present ? 1: 0)]; unsigned int(f(flags_size_index)) sample flags( sample_count - (first_sample_info_present ? 1: 0)]; if (version == 0) ( unsigned int(f(composition_size_index)) sample_composition_time_offseff sample_count.7; else ( signed int(f(composition_size_index)) sample composition_time offset[ sample_count]; Each optimization may be applied independently from the others but the maximum compression can generally be obtained by combining these all together.
In alternative embodiments, the compact 'tun' box uses the "exhaustive default values" mode and the sample description is provided in a loop on the samples (and not as a list of arrays as in the above): aligned(8) class CompactTrackRunBox extends FullBox('ctm; version, tr flags) { // all index fields take value 0,1,2,3 indicating 0,1,2,4 bytes unsigned int(2) duration_size index; unsigned int(2) sample size_index; unsigned int(2) flags size index; unsigned int(2) composition_size index; unsigned int(2) sample_count size_index; unsigned int(2) first sample_size_index; unsigned int(2) first fiags size index; unsigned int(2) data_offset size_index; unsigned int(ffsample_count size_index)) sample_count; signed int (ffdata_offset size_index) ) data offset; unsigned int(fifirst sample size_index) ) first sample_size; unsigned int(f(first flags size_index) ) first sample_flags; for (sample =0; sample<sample count; sample++) { // all the following parameters are effectively optional // as the field sizes can be zero unsigned int(f(duration_size_index)) sample_duration; unsigned int(ffsample size index)) sample_size; unsigned int(f(flags size_index)) sample_ flags; if (version == 0) { unsigned int(f(composition size_index)) sample composition_time_offset; else signed int(f(composition_size index)) sample_composition_time offset; with unchanged semantics In particular embodiments, the compact 'tun' box uses the "exhaustive default values" mode with a variable number of bits to encode the parameters, but these number of bits, instead of being defined in dedicated 2 bits codes are specified through the flags value of the compact 'tun' box. This is possible because with the "exhaustive default values" mode, we don't need flags anymore to indicate presence or absence of the parameters. This saves 16 bits per trun box. The compact 'trun' box then rewrites (with unchanged semantics, only bit length computation) as follows: aligned(8) class CompactTrackRunBox extends FullBox('ctm; version, tr flags) { unsigned int( nb_bits (tr flags)) sample_count; signed int ( nb_bits (tr flags » 2)) data_offset; II >> the binary shift operator // information on first sample: unsigned int( nb_bits (tr flags » 4))sample duration; unsigned int( nb_bits (tr flags » 6))first sample_size; unsigned int( nb_bits (tr flags » 8) ) first sample_flags; if (version == 0) { unsigned int( nb_bits (tr flags » 10)) sample composition time_offset; ) else { signed int(nb_bits (tr flags >> 10)) sample_composition_time offset; // remaining samples: for (sample 1; sample<sample count; sample++) ( // all the following parameters are effectively optional! as the field sizes can be zero unsigned int(nb_bits (tr flags » 4)) sample_duration; unsigned int(nb_bits (tr flags » 6)) sample_size; unsigned int(nb_bits (tr flags » 8)) sample_flags; if (version == 0) { unsigned int(nb_bits (tr flags » 10)) sample_composition_time_offset; ) else { signed int(nb_bits (tr flags » 10)) sample_composition_time offset; Where the following tr flags values are defined for the compact trun' box: For a given parameter, a 2bits flags value is defined. The following function returns the actual number of bits in use for a given 2bits flags value.
unsigned int(8) nb bits(2bits flags value) { switch(2bits flags value I 00000011) ( //I is the binaty OR operator case 0: return 0; case 1: return 8; case 2: return 16; case 3: retum 32; The 2bits flags value may be defined: sample count 2bit flags value, is one value in [0x00, Ox01, 0x02, 0x03] data offset 2bit flags value is one value in [0x00, 0x04, Ox08, OxCJ first sample_size 2bit flags value is one value in [0x00, Ox10, 0x20, 0x30] first sample_flags 2bit flags value is one value in in [0x00, 0x40, Ox80, OxCO] sample duration_2bit flags value is one value in [0x00, Ox100, Ox200, Ox300] sample size_2bit flags value is one value in [0x00, 0x400, Ox800, OxCOO] sample flags 2bit flags value value is one value in [0x00, Ox1000, Ox2000, Ox3000] composition_time_2bit flags value value is one value in the reserved value range [0x00, Ox4000, Ox8000, OxC000], From the above list, the 16--bit word formed by the tr flags value provide the number of bits used to represent each parameter of the compact 'tun' box. The order of the declaration and the values above may be changed but the corresponding computation in the compact 'tun shall be updated accordingly, typically the bit shifting operation in the call to the nb_bits function. This mechanism is extensible and may be used to add the reordering or interleaving parameter in the same way.
Optimization of the ?run' box usinq a ?run' box relying on patterns The Sample_flags information may be compacted in the 'tun' box relying on patterns. This is useful as it enables storing sample_flags on 16 bits, getting rid of the sample_degradation priority field that is not used by most (if not all) sequences.
According to embodiments, a new flag is introduced in the TrackRunPatternBox to adapt the number of bits to represent the sample flags information: aligned(8) class TrackRunPattemBox extends FullBox(trupt, version, flags) { // length of subsequent syntax elements unsigned int(2) nbml_sample_count; unsigned int(2) nbml_sample_duration; unsigned int(2) nbml_pattem index; unsigned int(2) nbm 1_cl offset; unsigned int(2) nbml_sample_flags; unsigned int(6) reserved; numBitsSampleCount = (nbml_sample_count + 1) " 8; numBitsSampleDuration = (nbmi_sample_duration + 1) * 8; 35 numBitsPattemIdx = (nbml_pattem_index + 1)" 8; numBitsCTOffset = (nbmi_ct offset + 1) " 8; numBitsSampleFlags= (nbml_sample_flags + 1) " 8; The track run pattern struct is adapted accordingly: aligned(8) class TrackRunPattemStruct(version, patldx, numSamples, boxFlags, 5 numBitsSampleFlags) for (i = 0; i < numSamples; i++) if (boxFlags & SAMPLE DURATION PRESEN7) unsigned int(numBitsSampleDuration) sample_durationThatIdxffil; if (boxFlags & SAMPLE FLAGS PRESENT) unsigned int(numBitsSampleFlags) sample_flagspatIdxfl Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising: obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data; generating metadata describing the obtained fragment, the metadata comprising structured metadata items (e.g., boxes), a metadata item of the structured metadata items (e.g., the trun' pattern box) describing samples using patterns and comprising a configurable parameter (e.g., SAMPLE FLAG) having a configurable size, wherein the configurable parameter provides characteristics (or properties) associated to a sample of the set of continuous samples; and, encapsulating the timed media data and the generated metadata.
According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data.
Concerning the flags of the trun' box relying on patterns, it seems that some of the flags values introduced for the track run box relying on patterns rather apply to the TrackRunPatternBox than to the 'tun' box relying on patterns itself. Accordingly, one may consider either splitting the set of flags or describing the flags allowed in these two boxes. According to a preferred embodiment, the set of flag values is split and new flags are defined for the TrackRunPatternBox. It is to be noted that the values and names are just examples: any reserved values and names not conflicting with other flag values can be used.
Flags value Name of the flags value Semantics of the flags value Ox000001 SAMPLE_DURATION_PRESENT indicates that sample has its own duration, otherwise the default is used.
0x000020 SAMPLE_SIZE_PRESENT each sample has its own size, otherwise the default is used.
0x000040 SAMPLE_FLAGS_PRESENT each sample has its own flags, otherwise the default is used.
Ox000080 SAMPLE_CT_OFFSETS_PRESENT each sample has a composition time offset (e.g. as used for I/P/B video in MPEG).
When the fragment description is based on a 'tun' box relying on patterns, the sample count is always present. In the case where a tun' box describes a single GOP, the sample count may be the same as the sample count value of the pattern. Therefore, according to particular embodiments, a new flag value is defined for the 'tun' box in pattern mode: Ox100000 SAMPLE_COUNT_PRESENT If set, indicates that the sample count field is present; if not set, the sample count for this trun is the same as the sample count in the refered pattern.
The version of the 'tun' box relying on patterns is then updated accordingly: aligned(8) class TrackRunBox extends FullBox("truni, version, tr flags) { if (version == 0 II version == 1) { // syntax of trun is unchanged else if (version >= 2)1 if (numPattems > 1) { unsigned int(numBitsPatternIdx) pat idx; patldx = pat idx; else patldx = 0; if (tr flags & SAMPLE COUNT PRESENT) unsigned int(numBitsSampleCount) sample_count minus1; if (ft _flags & DATA_OFFSET PRESENT) signed int(32) data_ offset; initSampleFlag = ((tr flags & FIRST SAMPLE PRESENT) > 0); if (initSampleFlag == 1) { unsigned int(numBitsFirstSampleFlags) sample_flags[0]; unsigned int(numBitsFirstSampleDuration) sample_duration[0]; unsigned int(numBitsSampleSize) sample_size[0]; if (version == 2) unsigned int(numBitsFirstSampleCtOffset) sample composition_time_offset[0]; else // version 3 signed int(numBitsFirstSampleCtOffset) sample composition_time_offset[0]; if (numBitsSampleSize > 0) ( for (i = initSampleFlag, inPatternidx = 0, totalBits = 0; i <= sample count minus1; i++) unsigned int(numBitsSampleSize[patidx][inPattemidx])) sample size[i]; totalBits += numBitsSampleSizerpatidxffinPatternidx]; refldx[i] = inPatternidx; inPatternIdx = ((inPatternIdx + 1) % (pattern len minus1[patidx] + 1)); // byte alignment numBitsInLastByte = totalBits % 8; if (numBitsinLastByte) bit(8-numBitsInLastByte) reserved = 0; Another consideration regarding the GOP pattern is that when a video sequence uses a fixed GOP pattern, it is common that the first sample (typically an IDR frame) of the GOP usually has a much larger frame size (than other Predicted or Bi-directional frames). In the meantime, the other properties (sample flags, CT offset, duration) are usually always the same from one sample to another. The current design of the 'tun' box relying on pattern makes provision for specific handling of the first sample in the 'trun'. However, the pattern structure enables a per-sample number of bits to encode the size, which can be used to handle the first sample of the 'tun' or the GOP if there are multiple GOPs in the trun (i.e. the pattern is repeated).
According to embodiments, the pattern 'trun' is simplified by removing all the first sample items of information (and related flags). This is simply done by looping on all the samples in the run instead of starting on the second one (i.e. instead of having specific signaling for the first sample): aligned(8) class TrackRunBox extends FullBoetruni, version, tr flags) { if (version == 0 II version == f // syntax unchanged else if (version >= 2) { if (numPattems > 1) { unsigned int(numBitsPattemIdx) pat idx; patldx = pat idx; else patldx = 0; unsigned int(numBitsSampleCount) sample_count minus1; if (tr flags & DATA_OFFSET PRESENT) signed int(32) data offset; if (numBitsSampleSize > 0)1 for (i = 0, inPattemIdx = 0, totalBits = 0; i <= sample_count minus1; i++){ unsigned int(numBitsSampleSize[patIdx]finPatternIdx1)) sample_size[i]; totalBits += numBitsSampleSize[patIdx]/inPatternIdxf; refldx[i] = inPatternIdx; inPatternIdx = ((inPatternIdx + 1) % (pattern_len_minusl[patIdx] + 1)); // byte alignment numBitsInLastByte = totalBits % 8; if (numBitsInLastByte) bit(8-numBitsInLastByte) resented = 0; According to embodiments, the 'trun' box relying on patterns is optimized by using a variable bit length for coding the data offset, for example indicated by a specific flag value in the 'tun' box. For example, the value Ox100000 and the name "DATA OFFSET_16" are reserved to indicate that when it is set, this value indicates that the data offset is coded on 16 bit. This flag value shall not be set if the data_offset present flags value of the TrackRunBox is not set and the base-data-offsetpresent flags of the TrackFragmentHeaderBox is not set. The 'trun' box comprising such an optimization then rewrites: aligned(8) class TrackRunBox extends FullBox('trunc version, tr flags) { if (version == Oil version == 1) { // syntax unchanged else if (version >= { if (numPattems > 1) { unsigned int(numBitsPattemIdx) pat idx; patldx = pat idx; else patIdx = 0; unsigned int(numBitsSampleCount) sample_count minus1; if (tr flags & DATA_OFFSET PRESENT) f if (tr flags & DATA_OFFSET 16) signed int(16) data_offset; else signed int(32) data_offset; initSampleFlag = ((tr flags & FIRST SAMPLE PRESENT)> 0); if (initSampleFlag == 1) { unsigned int(numBitsFirstSampleFlags) sample_flags[0]; unsigned int(numBitsFirstSampleDuration) sample_duration[0]; unsigned int(numBitsSampleSize) sample size[0]; if (version == 2) unsigned int(numBitsFirstSampleCtOffset) sample composition_time_offset[0]; else // version 3 signed int(numBitsFirstSampleCtOffset) sample composition_time_offset[0]; if (numBitsSampleSize > 0) ( for (i = initSampleFlag, inPatternIdx = 0, tote/Bits = 0; i <= sample count minusl; i++) unsigned int(numBitsSampleSize[patIdx]/inPatternId4)) sample size At totalBits += numBitsSampleSizerpandxffinPatternIdx]; refldx[i] = inPatternIdx; inPatternIdx = ((inPattemIdx + 1) % (pattern len minus1[patIdx] + 1)); // byte alignment numBitsInLastByte = totalBits % 8; if (numBitsInLastByte) bit(8-numBitsInLastByte) reserved = 0; In other embodiments handling the first sample of a run of sample in the 'tun' box relying on patterns, the pattern currently using sample_flags for all samples (first and others) is modified by using a FIRST_SAMPLE_FLAGS in the pattern definition, to use a full 32 bit for the first sample of the pattern: aligned(8) class TrackRunPatternStruct(version, patldx, numSamples, boxFlags){ for (i = 0; i < numSamples; i++) if (boxFlags & SAMPLE DURATION PRESEN7) unsigned inffnumBitsSampleDuration) sample_duration[patIdxfii]; if (1=0) ( if (boxFlags & FIRST SAMPLE_FLAGS_PRESEN7) unsigned int(32) sampleflagsThatIdall; ) else ( if (boxFlags & SAMPLE FLAGS PRESEN7) unsigned int(32) sample_flagsfpatIdxffil; if (boxFlags & SAMPLE CT OFFSETS PRESENT) { if (version == 0) signed int(numBitsCTOffset) sample composition time offseffpatIclxifil; else unsigned int(numBitsCTOffset) sample composition_time offset[patIdx]fl]; if (boxFlags & SAMPLE SIZE PRESENT) { for(i = 0; i < numSamples; i++) unsigned int(4) num sample_size_nibbles minus2[oatidx][0; numBitsSampleSizeThafidxffij = (num_sample_size_nibbles minus2[patIdx)171 + 2) * 4; if (numSamples % 2) bit(4) reserved = 0; It is lobe noted that these optimization on data_offset, first_sample handling, flag values for the sample count presence, the specific flag value for the track run pattern box and the variable bit length for sample flags, may be combined to further improve the efficiency of the 'trun' box relying on patterns.
The "exhaustive default values mode" variant can be used for trun' boxes relying on patterns with the exhaustive list of default values, it can be defined in the pattern description. The pattern itself may use some of these default values and the TrackRunPatternBox is also modified to allow a null number of bits to support absence of one parameter without checking any flags value: aligned(8) class TrackRunPatternBox extends FullBox('trupc version, flags) { // length of subsequent syntax elements (exhaustive list) unsigned int(2) nbmtsample_count; unsigned int(2)nbm1_sample_duration; unsigned int(2)nbm1_pattem_index; unsigned int(2) nbratct offset; unsigned int(2) nbmtsample_size; unsigned int(2) nbmtsample flags; unsigned int(2) nbmtdata_offset; //These two ones may be omitted if the trun relying on patterns does not include specific processing of the first sample in the run unsigned int(2) nbml_first sample size; unsigned int(2) nbmtfirst sample_flags; I/O, 8; 16 or 32 bits are used: numBitsSampleCount = (nbmtsample_count & 2) * 16 + (nbml_sample_count & 1) * 8; numBitsSampleDuration = (nbml_sample duration&2) * 16 + ( (nbml_sample_duration&l)" 8; numBitsPatternIdx = (nbml_pattem index + 1)* 8; // from 0 to 24 bits numBitsCTOffset = (nbml_ct offset&2) "16 + (nbml_ct offset&l)* 8; numBitsSampleSize= (nbml_sample_size&2) "16 + (nbml_ sample size&l)" 8 numBitsSampleFlags = (nbml_sample_flags&2) " 16 + (nbml_ sample fiags&1)" 8 numBitsData0ffset = (nbm-l_data offset&2) "16 + (nbml_ data_offset&l) * 8; //These two ones may be omitted if the trun relying on patterns does not include specific processing of the first sample in the run numBitsFirstSampleSize = (nbm1_ first sample_size&2) * 16 + (nbm1_ first sample size &1)" 8; numBitsSampleFirstSampleFlags= (nbm1_ first sample fiags&2)" 16 + (nbm1_ first sample_flags &1) * 8; numPatterns = 0; for (I = 0,7i++)// until the end of the box unsigned int(8) patternien_minusilit TrackRunPattemStruct(version, i pattem_len_minus1p7+1, flags)// flags may be no more needed trackRunPattern[i]; numPatterns++; 25) The different variables in the pattern definition above provide the actual number of bits to describe and to parse samples in a run of samples.
With the above TrackRunPatternBox, the TrackRunPatternStruct can be modified as follows, allowing a parser to avoid tests on presence or absence of
parameters in the sample description:
aligned(8) class TrackRunPattemStruct(version, patldx, numSamples, boxFlags) for (i = 0; i < numSamples; i++) { unsigned infinumBitsSampleDurafion) sample_durafion[patIdx][i]; unsigned int(numBitsSampleFlags) sample_flags[paticlx][i]; if (version == 0) signed int(numBitsCTOffset) sample_composition_time_offset[patIdx][i]; else unsigned int(numBitsCTOffset) sample_compositiontime_offset[patidx][i]; if (numBitsSampleSize) ( for(i = 0; i < numSamples; i++) { unsigned int(4) num_sample_size_nibbles_minus2[patIdx][i]; numBitsSampleSize[patldx][i] = (num_sample_size_nibbles_minus2IpatIdx][i] + 2)* 4; if (numSamples °A) 2) bit(4) reserved = 0; 1 and the trun box relying on these pattern definitions rewrites as follows: aligned(8) class TrackRunBox extends FullBox(trunc version, tr flags) { if (version == Oil version == 1)1 // syntax unchanged else if (version >= 2)1 if (numPatterns > 1)1 unsigned int(numBitsPattemIdx) pat idx; patldx = pat idx; else patldx = 0; unsigned int(numBitsSampleCount) sample count minus1; signed int(numBitsData0ffset) data_offset; // No more test on presence of first sample unsigned int(numBitsFirstSampleFlags) sample_fiags[0]; unsigned int(numBitsFirstSampleDuration) sample_duration[0]; unsigned int(numBitsSampleSize) sample size[0]; if (version == 2) unsigned int(numBitsFirstSampleCtOffset) sample_composition_time_offset[0]; else // version 3 signed int(numBitsFirstSampleCtOffset) sample_composition_time_offset[0]; if (numBitsSampleSize > 0) ( for (i = initSampleFlag, inPattemIdx = 0, totalBits = 0; i <= sample_count minus1; i++){ unsigned int(numBitsSampleSizeThatIdxffinPatternIdxJ)) sample_sizefil; totalBits += numBitsSampleSize[patIdx]/inPatternIdx]; refldx17] = inPattemIdx; inPattemIdx = ((inPattemIdx + 1) % (pattem_len_minusgoatIdx] + 1)); // byte alignment numBitsInLastByte = totalBits % 8; if (numBitsInLastByte) bit(8-numBitsInLastByte) reserved = 0; As for compact 'tun' box, when the flags of the TrackPAtternBox are not used to control the presence or absence of some parameters, the number of bits in use to encode the parameters may be provided as a list of 2bits flags value.
For most video formats, the file format may carry within the metadata for sample description the composition time offsets (in ctts' box) to indicate a sample presentation time. The sample presentation time may correspond to the composition time or may correspond to the composition time adjusted by one or more edit lists (described in celst box). The composition time offset for a given sample is coded as the difference between the sample presentation time of this sample and the current sample delta (sum of the durations of the previous samples). This offset is usually coded on 32 bits (for example in the standard 'tun' or in 'offs' boxes), or on a smaller number of bits (8 to 32 bits) in the compact trun box. The sample composition time (CT) offset is expressed in media timescale, which for video usually is a multiple of the framerate or a large number (e.g. 90k or 1M), resulting in large composition offsets, which can be quite heavy in terms of signalling. For some simple framerate (integer number), this is not an issue as a small timescale can be picked, but this does not apply to non-integer framerates or to some distribution systems enforcing a very high timescale. In a typical GOP (Group Of Pictures in a video stream), some frames have a CT offset different than 0, some have a CT of 0 (for which samples this applies depends on the GOP structure and the ctts' version, i.e. positive or negative offsets).
For example an IBBP pattern repeated in a GOP may have the following Storing the CT offset per sample (instead of run-length encoding) would allow gaining some space, but would require a different signaling (typically through sample flags) per sample, which in turn is not very efficient. In most cases however, the CTS offset is not just any number, it is a delay in number of frames, and can be expressed as N * sample_durafion. Since the sample duration is known in video with constant frame rate, we can see that storing the number of frames instead of the actual offset will achieve higher compactness. For example, for a 30 fps video with a timescale of 30000 in a one second GOP (e.g. sample duration=1000), the CT offset of the first P following the IDR can go up to 29 frames. Hence 29"1000 = 29000, requiring 2 bytes to store the CTS offset but only one byte with our approach (overall gain for the GOP is 30 times 1 byte). For a 3 second GOP (90 frames), the offset could reach 89*1000 = 89000, requiring 3 bytes to store the CT offset, but still only one byte with our approach (overall gain for the GOP = 90 times 2 bytes). In some corner cases, the CT offset might need to be expressed as a multiple of the sample duration (e.g. 29.97 fps at drop boundary). In order to keep the possibility to use both signaling (timescale difference or frame difference); we then propose to define within metadata describing the samples an indication about how the CT field should be interpreted.
Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the timed media data being requested by a client, the method being carried out by a server and comprising: decoding and composition times and offsets: (decoding order) 11 P4 B2 B3 P7 B5 B6...
decoding time 0 10 20 30 40 50 60 (DT) composition time 10 40 20 30 70 50 60 (CT) decode delta 10 10 10 10 10 10 10 (DT(n+1) -DT(n)) CT offset 10 30 0 030 0 0 CT/Decode delta 1 3 0 0 3 0 0 obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data, generating metadata describing the obtained fragment, the metadata comprising structured metadata items (e.g., boxes), wherein a metadata item of the structured metadata items comprises an indication information (e.g., SAMPLE_CT_FRAME) indicating whether a composition time offset parameter is coded as a function of a sample duration or not; and, encapsulating the timed media data and the generated metadata According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data.
Specific embodiments depending on the kind type of the 'trun' box (e.g., standard, compact, or relying on patterns) are described below.
In an embodiment where the trun box used to encapsulate the media fragments is a standard 'tun' box, the indication of the composifion_fime_offset could be present in the sample description, for example as a specific flags value in the CompositionOffsetBox ('ctts'): Ox000001 sample-composition-rime-offsets-frames: when set, this indicates that the composition offset is coded as a multiple of sample duration, and shall be recomputed by multiplying the coded value by the sample duration. If not set, the composition offset is coded in timescale units. When the flags is set, the fields of the box uses half the bits than when this flags value is not set (16 bits instead of 32) to take benefit of shorter code for the sample offset.
The ictts' box would then be modified as follows On bold): aligned(8) class Composition OffsetBox extends FullBox('cttsc version, flags) unsigned int(32) entry count; if (version==0) if (flags & Ox000001) ( for (int 1=0; i < entry count; i++) I unsigned int(16) sample_count; unsigned int(16) sample_offset; else ( for (int i=0; i < entry count; i++) unsigned int(32) sample_count; unsigned int(32) sample_offset; else if (version == 1) { if (flags & Ox000001) for (int i=0; 1< entry count; i++)( unsigned int(16) sample_count; signed int(16) sample_offset; else for (int 1=0; i < entry count; i++){ unsigned int(32) sample_count; signed int(32) sample_offset; In an embodiment where the trun box used to encapsulate the media fragments is a compact 'tun' box, the following flags value is defined (in addition to existing ones) for the compact track run box. It is to be noted that this value, respectively the name, is just an example, any reserved or dedicated value, resp. name, not conflicting with existing flags value, resp. name can be used: Ox001000 sample-composition-time-offsets-frames; when set, this indicates that the composition offset is coded as a multiple of sample duration (whatever the number of bits used), and shall be recomputed by multiplying the coded value by the sample duration. If not set, the composition offset is coded in timescale units. For example the packaging module 313 on figure 3 can be informed that the encoding is done with constant frame rate. In such case, it sets the flags value and provides the composition time offsets as a multiple of a sample duration, thus reducing the necessary number of bits.
According to embodiments where the 'tun' box used to encapsulate the media fragments is relying on patterns, an additional flags' value is defined for the track run pattern box as follows, in addition to the other existing flags values: Ox100000 SAMPLE_CT_FRAME: when this bit is set, it indicates that the composition offset is coded as a multiple of sample duration, and shall be recomputed by multiplying the coded value by the sample duration. If not set, the composition offset is coded in fimescale units.
Again, it is to be noted that this flags value, respectively the name, is just an example, any reserved or dedicated value, respectively name, not conflicting with existing flag value, respectively name can be used.
As an alternative to the flag value indicating that the composition time offset is coded as a multiple of sample duration, this can be inferred in specific cases where the flags in the box hierarchy describing the fragments (e.g. 'moor or tar) indicate a default duration or that the sample duration is not present in the 'tun' box.
According to other embodiments, the media data to encapsulate come with additional or auxiliary data. For example, it is a depth item of information accompanying a video stream. In another example, it is auxiliary data describing encryption parameter per sample as used by MPEG Common Encryption (CENC). Fragmenting the media data and their auxiliary items of information may use the tun' plus the csaiz' boxes to encapsulate these data in the same media file or media segment (as mandated for example in ISOBMFF). The current syntax for 'saiz' box is as follows: aligned(8) class SampleAuxiliaryln formationSizesBox extends FullBox('saiz', version = 0, flags) if (flags & 1) { unsigned int(32) aux info_type; unsigned int(32) aux info_type parameter; unsigned int(8) default sample_info_size; unsigned int(32) sample_count; if (default sample_info size == 0) { unsigned int(8) sample_info_size[ sample count]; In the example use case of media data encryption, the MPEG Common Encryption scheme uses auxiliary data describing encryption parameter per sample. This information typically includes the Initialization Vectors (IV) for the whole sample, or the IV and a list of clear and encrypted byte ranges in the sample (subsample encryption). In some configuration such as cbcs with constant IV, this information is empty and consequently omitted. In other configurations, this information shall be signaled through the sample auxiliary mechanism, using saiz' and 'saio' boxes (in main movie or in movie fragments). For subsample encryption, the size of auxiliary data can change in the following cases: - different number of slices in each frame, leading to different number of subsamples for configuration where the slice header shall be unencrypted. This might be useful to let slice header unencrypted when slice header rewriting is needed: for example when mixing files. Another case where it is useful is when the application needs to identify which part is encrypted by inspected the slice header, for example in selective encryption use cases where only a spatial part like a slice or a tile is encrypted; - injection at specific frames of large Supplemental Enhancement Information (SEI) data (for example more than 65k bytes), forcing to create a new subsample entry with no encrypted bytes, but this is not so common; - mixing both protected and non-protected samples: the protected samples will have an associated saiz' entry different from 0, while the unprotected samples will have an associated saiz' entry equal to 0. This may correspond to an area encrypted for privacy reason or to an area where you have to pay to see the content of a particular area of interest; -mixing different configuration of the encryption parameters, such as different per_sample_lnitialization Vectors_size, - in schemes supporting partial encryption of Video Coding Layer data (such as sensitive encryption), varying number of protected byte ranges across samples; and -in schemes supporting multiple key encryption (such as sensitive encryption), varying number of keys used per sample.
The variations are documented in an associated sample group description entry of type seig', and the mapping of each sample to the group is done using the 25 SampleToGroupBox, with a compact version proposed in DAM1 of 14496-12. However, a compact representation for the description of the size has not been studied.
The variations can take various aspects: repeated pattern, single slot variations or burst of the same value. However, the resulting sizes usually cover a well-defined set of values, representing all the possible encryption/encoding configurations.
As can be seen from the above definition, a single variation in the auxiliary sample data size (default_sample_into_size) results in expanding the entire table, which is not very efficient.
Therefore, according to particular embodiments, a new version of the 'saiz' box is defined, enabling simple run-length encoding addressing most use cases, and pattern description for cases where pattern can be used.
aligned(8) class SampleAuxiliaryInformationSizesBox extends FullBox('saif, version, flags) if (flags & 1) { unsigned int(32) aux infotype; unsigned int(32) aux info type parameter; if (version 0) unsigned int(8) default sample info_size; unsigned int(32) sample_count if (default sample info size == 0) { unsigned int(8) sample info_sizel sample count); ) else if (version==1) unsigned int(32) entry count for (i=0; i<entty count i++) unsigned int(8) sample count in entry; unsigned int(8) si rle_size; ) else if (version==2) unsigned int(32) pattern_count for (i=0; i < pattern_count i++) // pattern definition unsigned int(8) pattemiength unsigned int(8) sample pat count[;; for (1=0;] < pattern_count j++) for (k=0; k < pattem_lengthg k++) unsigned int(8) si pat sizefil[k]; with the following semantics: entry count gives the number of entries in the box when version 1 is used; sample count in entry gives the number of consecutive samples for which the si de size applies. Samples are listed in decoding order. The same remarks as for sample_count applies; Si rle size gives the size in bytes of the sample auxiliary info size for the samples in the current entry; pattern count indicates the number of successive patterns in the pattern array that follows it. The sum of the included sample pat count [i] values indicates the number of mapped samples; pattern_length [i] corresponds to a pattern within the second array of si pat sine [j] values. Each instance of pattern length [i] shall be greater than 0; sample pat count [i] specifies the number of samples that use the ith pattern; sample_pat_count [i] shall be greater than zero, and sample_pat_count [i] shall be greater than or equal to pattern length [i]; si pat size [j] [IC is an integer that gives the size of the sample auxiliary info data for the samples in the pattern.
When sample_pat_count[i] is equal to pattern_length[i, the pattern is not repeated.
When sample_pat_count[i] is greater than pattern_length[i], the 15 si_pat_size[i][] values of the ith pattern are used repeatedly to map the sample_pat_count[i] values. It is not necessarily the case that sample_pat_count[i] is a multiple of pattern_length[i]; the cycling may terminate in the middle of the pattern. The total of the sample_pat_count[i] values for all values of i in the range of 1 to pattern_count, inclusive, shall be equal to the total sample count of the track Of the box is present in the sample table) or of the track fragment.
An alternative compact representation of the 'saiz' box, to avoid redefinition of patterns when they reappear after a different pattern. For example, assuming the following patterns ABC DE DE ABC XY ABC, the pattern "ABC" reappears after "DE" pattern. To avoid this, the pattern is referred to through a pattern index as follows; aligned(8) class SampleAuxiliaryInformationSizesBox extends FullBox('saiz; version, flags) if (flags & 1) { unsigned int(32) aux info_type; unsigned int(32) aux info_type parameter; sample_count; if (version 0) unsigned int(8) default sample info_size; unsigned int(32) if (default sample info size == 0) { unsigned int(8) sample info_sizef sample count); }else if (version==1) unsigned int(32) entry count; for (i=0; i<entry count; i++) unsigned int(8) sample count in entry; unsigned int(8) si rle_size; ) else if (version==2) unsigned int(32) entry count; for (i=0; i < entry count; i++) unsigned int(8) pattern_idxffi; unsigned int(8) sample pat count[it unsigned int(8) pattern_count; for (j=0; j < pattern count; j++) unsigned int(8) pattern_lengthfil for (k=0; k < pattern_lengtha k++) unsigned int(8) si pat sizegykl; 20) Therefore, according to an aspect of the invention, there is provided a method for encapsulating timed media data, the media data being requested by a client, the method being carried out by a server and comprising: obtaining a fragment of the timed media data, the fragment comprising a set of continuous sample(s) of the timed media data, and comprising auxiliary information associated to the continuous samples; generating metadata describing the obtained fragment, the metadata defining an auxiliary information size of auxiliary information associated to the continuous 30 samples; wherein the metadata sub-item comprises a parameter determined as a function of a number of time a pattern is used; and, encapsulating the timed media data and the generated metadata. According to another aspect, there is provided a method for transmitting and a method for processing the encapsulated timed media data.
The optimized version of the 'sail box can be combined with any kind of trun' box: the standard 'tun' box, the compact trun' box, or the trun' box relying on patterns.
The reordering indication can be combined with the further optimized compact 'tun' box according to embodiments of the invention. It can be combined with a compact 'tun' containing one of the proposed optimizations or all the proposed optimizations for better efficiency. The encapsulated file or segment may further contain a compact saiz' or saiz' box according to embodiments of the invention. In the latter case, the auxiliary data are advantageously placed at the beginning of the mdat. For example, in the case of encrypted content, the encryption information is always available whatever the number of video samples that is sent or received.
The new unit to describe the composition time_offset may be used with reordering information, whatever the type of tun' box in use: standard, compact, or relying on patterns. The encapsulated file or segment may further contain a compact 'saiz' or saiz' box according to this invention. In the latter case, the auxiliary data are advantageously placed at the beginning of the mdat. For example, in the case of encrypted content, the encryption information is always available whatever the number of video samples that is sent or received.
The reodering indication can be combined with the further optimized 'tun' box relying on patterns according to embodiments of the invention. It can be combined with a 'trun' box relying on patterns containing one of the described optimizations or all the described optimizations for better efficiency. The encapsulated file or segment may further contain a compact 'saiz' or saiz' box according to embodiments of the invention. In the latter case, the auxiliary data are advantageously placed at the beginning of the mdat. For example, in the case of encrypted content, the encryption information is always available whatever the number of video samples that is sent or received..
The compact saiz' box may be used with any version of the 'tun' box: standard 'trun', compact 'trun' box, or 'trun' box relying on patterns. The compact 'saiz' box may also be used when fragments are reordered as described with reference to Figures 4 to 7. In the latter case, the auxiliary data are advantageously placed at the beginning of the mdat. For example, in the case of encrypted content, the encryption information is always available whatever the number of video samples that is sent or received.
Figure 8 is a schematic block diagram of a computing device 800 for implementation of one or more embodiments of the invention, in particular all or some of the steps described by reference to Figures 3, 4, 6, and 7. The computing device 800 may be a device such as a micro-computer, a workstation or a light portable device. The computing device 800 comprises a communication bus connected to: - a central processing unit (CPU) 801, such as a microprocessor; -a random access memory (RAM) 802 for storing the executable code of the method of embodiments of the invention as well as the registers adapted to record variables and parameters necessary for implementing the method for reading and writing the manifests and/or for encoding the video and/or for reading or generating data under a given file format, the memory capacity thereof can be expanded by an optional RAM connected to an expansion port for example; - a read only memory (ROM) 803 for storing computer programs for implementing embodiments of the invention; -a network interface 804 that is, in turn, typically connected to a communication network over which digital data to be processed are transmitted or received. The network interface 804 can be a single network interface, or composed of a set of different network interfaces (for instance wired and wireless interfaces, or different kinds of wired or wireless interfaces). Data are written to the network interface for transmission or are read from the network interface for reception under the control of the software application running in the CPU 801; -a user interface (UI) 805 for receiving inputs from a user or to display information to a user; -a hard disk (HD) 806; - an I/O module 807 for receiving/sending data from/to external devices such as a video source or display.
The executable code may be stored either in read only memory 803, on the hard disk 806 or on a removable digital medium for example such as a disk. According to a variant, the executable code of the programs can be received by means of a communication network, via the network interface 804, in order to be stored in one of the storage means of the communication device 800, such as the hard disk 806, before being executed.
The central processing unit 801 is adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to embodiments of the invention, which instructions are stored in one of the aforementioned storage means. After powering on, the CPU 801 is capable of executing instructions from main RAM memory 802 relating to a software application after those instructions have been loaded from the program ROM 803 or the hard-disc (HD) 306 for example. Such a software application, when executed by the CPU 801, causes the steps of the flowcharts shown in the previous figures to be performed.
In this embodiment, the apparatus is a programmable apparatus which uses software to implement the invention. However, alternatively, the present invention may be implemented in hardware (for example, in the form of an Application Specific Integrated Circuit or ASIC).
Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a person skilled in the art which lie within the scope of the present invention.
Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular the different features from different embodiments may be interchanged, where appropriate.
In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used.

Claims (22)

  1. CLAIMS1. A method for encapsulating encoded media data, the method comprising: obtaining samples of the encoded media data, the samples of the encoded media data being ordered according to a first ordering; and encapsulating samples of the obtained samples, ordered according to a second ordering, the second ordering depending on a priority level associated with each of the obtained samples for processing the encapsulated samples, upon decapsulation; and reordering information associated with the encapsulated samples for re-ordering the encapsulated samples according to the first ordering, upon decapsulation.
  2. 2. The method of claim 1, wherein the media data are timed media data and wherein the obtained samples of the encoded media data correspond to a plurality of contiguous timed media data samples, the reordering information being encoded within metadata associated with the plurality of contiguous timed media data samples.
  3. 3. The method of claim 1 or claim 2, wherein the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
  4. 4. The method of claim 3, wherein each parameter value of the list is a position index, each position index being determined as a function of an offset and of the coding length of the obtained samples.
  5. 5. The method of any one of claims 1 to 4, wherein the obtained samples are encapsulated using the metadata associated with the samples.
  6. 6. The method of claim 5, further comprising obtaining a priority map associated with the obtained samples, the reordering information being determined as a function of the obtained priority map.
  7. 7. The method of any one of claims 1 to 6, wherein obtaining samples of the encoded media data comprises obtaining samples of the media data and encoding the obtained samples of the media data.
  8. 8. The method of claim 7, wherein the priority levels are obtained from the encoding of the obtained samples of the media data.
  9. 9. The method of claim 8, wherein the priority levels are determined as a function of dependencies between the obtained samples of the media data. 10
  10. 10. A method for transmitting encoded media data from a server to a client, the media data being requested by the client, the method being carried out by the server and comprising: encapsulating the encoded media data according to the method of any one of claims 1 to 9; and transmitting, to the client, the encapsulated encoded media data.
  11. 11. A method for processing encapsulated media data, the encapsulated media data comprising encoded samples and metadata, the metadata comprising reordering information, the method comprising: obtaining, samples of the encapsulated media data, the obtained samples of the encapsulated media data being ordered according to a second ordering; and reordering information; and reordering the obtained samples in a first ordering according to the obtained reordering information, the first ordering making it possible for the obtained samples to be decoded.
  12. 12. The method of claim 11, wherein the media data are timed media data and wherein the obtained samples of the encapsulated media data correspond to a plurality of contiguous timed media data samples, the reordering information being encoded within metadata associated with the plurality of contiguous timed media data samples.
  13. 13. The method of claim 11 or claim 12, wherein the reordering information comprises a list of parameter values, each parameter value of the list being associated with a position of one sample in a stream of samples.
  14. 14. The method of claim 13, wherein reordering the obtained samples comprises computing offsets as a function of the parameter values and of coding lengths of the encoded samples.
  15. 15. The method of any one of claims 11 to 14, further comprising of decoding the reordered samples.
  16. 16. The method of any one of claims 11 to 15, carried out in a client, the samples of the encapsulated media data and the reordering information being received from a server.
  17. 17. The method of any one of claims 1 to 16, wherein the format of the encapsulated media data is of the ISOBMFF type or of the CMAF type.
  18. 18. A computer program product for a programmable apparatus, the computer program product comprising a sequence of instructions for implementing each of the steps of the method according to any one of claims 1 to 17 when loaded into and executed by the programmable apparatus
  19. 19. A non-transitory computer-readable storage medium storing instructions of a computer program for implementing each of the steps of the method according to any one of claims 1 to 17.
  20. 20. A signal carrying an information dataset for media data, the information dataset comprising encapsulated encoded media data samples and reordering information, the reordering information comprising a description of an order of samples for decoding the encoded samples.
  21. 21. A media storage device storing a signal carrying an information dataset for media data, the information dataset comprising encapsulated encoded media data samples and reordering information, the reordering information comprising a description of an order of samples for decoding the encoded samples.
  22. 22. A device for transmitting or receiving encapsulated media data, the device comprising a processing unit configured for carrying out each of the steps of the method according to any one of claims 1 to 17.
GB1815364.3A 2018-09-20 2018-09-20 Method, device, and computer program for improving transmission of encoded media data Withdrawn GB2583885A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB1815364.3A GB2583885A (en) 2018-09-20 2018-09-20 Method, device, and computer program for improving transmission of encoded media data
PCT/EP2019/075372 WO2020058494A1 (en) 2018-09-20 2019-09-20 Method, device, and computer program for improving transmission of encoded media data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1815364.3A GB2583885A (en) 2018-09-20 2018-09-20 Method, device, and computer program for improving transmission of encoded media data

Publications (2)

Publication Number Publication Date
GB201815364D0 GB201815364D0 (en) 2018-11-07
GB2583885A true GB2583885A (en) 2020-11-18

Family

ID=64024186

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1815364.3A Withdrawn GB2583885A (en) 2018-09-20 2018-09-20 Method, device, and computer program for improving transmission of encoded media data

Country Status (2)

Country Link
GB (1) GB2583885A (en)
WO (1) WO2020058494A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022143615A1 (en) * 2020-12-28 2022-07-07 Beijing Bytedance Network Technology Co., Ltd. Cross random access point sample group
GB2602643B (en) * 2021-01-06 2023-04-05 Canon Kk Method, device, and computer program for optimizing encapsulation of images
CN113438691B (en) * 2021-05-27 2024-01-05 翱捷科技股份有限公司 TAS frame processing method and device
CN114371674B (en) * 2021-12-30 2024-04-05 中国矿业大学 Method and device for sending analog data frame, storage medium and electronic device
EP4266690A1 (en) * 2022-04-19 2023-10-25 Nokia Technologies Oy An apparatus, a method and a computer program for video coding and decoding
US20230362415A1 (en) * 2022-05-05 2023-11-09 Lemon Inc. Signaling of Preselection Information in Media Files Based on a Movie-level Track Group Information Box
CN117061189B (en) * 2023-08-26 2024-01-30 上海六坊信息科技有限公司 Data packet transmission method and system based on data encryption

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120023250A1 (en) * 2010-07-20 2012-01-26 Qualcomm Incorporated Arranging sub-track fragments for streaming video data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10044369B1 (en) * 2018-03-16 2018-08-07 Centri Technology, Inc. Interleaved codes for dynamic sizeable headers

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120023250A1 (en) * 2010-07-20 2012-01-26 Qualcomm Incorporated Arranging sub-track fragments for streaming video data

Also Published As

Publication number Publication date
GB201815364D0 (en) 2018-11-07
WO2020058494A1 (en) 2020-03-26

Similar Documents

Publication Publication Date Title
GB2583885A (en) Method, device, and computer program for improving transmission of encoded media data
EP3603080B1 (en) Method and apparatus for encoding media data comprising generated content
US11805304B2 (en) Method, device, and computer program for generating timed media data
JP6572222B2 (en) Media file generation method, generation device, and program
CN113170239B (en) Method, apparatus and storage medium for encapsulating media data into media files
US11805302B2 (en) Method, device, and computer program for transmitting portions of encapsulated media content
CN112019857A (en) Method and apparatus for storage and signaling of compressed point clouds
US20220167025A1 (en) Method, device, and computer program for optimizing transmission of portions of encapsulated media content
US20230025332A1 (en) Method, device, and computer program for improving encapsulation of media content
CN113574903B (en) Method and apparatus for late binding in media content
US20230370659A1 (en) Method, device, and computer program for optimizing indexing of portions of encapsulated media content data
WO2022148650A1 (en) Method, device, and computer program for encapsulating timed media content data in a single track of encapsulated media content data
GB2620582A (en) Method, device, and computer program for improving indexing of portions of encapsulated media data
WO2023274877A1 (en) Method, device, and computer program for dynamically encapsulating media content data
CN118044211A (en) Method, apparatus and computer program for optimizing media content data encapsulation in low latency applications

Legal Events

Date Code Title Description
COOA Change in applicant's name or ownership of the application

Owner name: CANON KABUSHIKI KAISHA

Free format text: FORMER OWNERS: CANON KABUSHIKI KAISHA;TELECOM PARIS TECH

WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)