GB2509956A

GB2509956A - Encapsulating tiled timed media data, applicable to scalable coding

Info

Publication number: GB2509956A
Application number: GB1300953.5A
Authority: GB
Inventors: Fr D Ric Maze; Herv Le Floch; Franck Denoual
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-01-18
Filing date: 2013-01-18
Publication date: 2014-07-23
Anticipated expiration: 2033-01-18
Also published as: GB201300953D0; GB2509956B

Abstract

A method of encapsulating tiled timed media data in a server, the tiled timed media data comprising timed samples, each having plural spatial subsamples, comprises: selecting at least one spatial subsample from amongst spatial subsamples of one of the timed samples; and creating one tile track (i.e., temporal sequence of tiles: Figure 2) comprising the selected spatial subsample and one corresponding spatial subsample of each of the other timed samples. At least one composite track - adding data required and missing due to the tile track only relating to a temporal track of parts of each frame, i.e. adding the information to be decoded, required network abstraction layer (NAL) units, etc - is created, comprising an extractor (reference object) identifying the created tile tracks. Each created track is then independently encapsulated. Also claimed is a method and device for providing a timed media bit-stream, in a client device, from data thus encapsulated, comprising selecting information representing a spatial subsample, requesting a corresponding tile track, receiving plural media segment files, and replacing extractors by corresponding identified spatial subsamples. An initialization segment file may be created, allowing tiled spatial portion formation from tile and composite tracks.

Description

METHOD, DEVICE, AND COMPUTER PROGRAM FOR EFFICIENT

ENCAPSULATION OF TIMED TILED MEDIA DATA

FIELD OF THE INVENTION

The invention generally relates to the field of encapsulation of timed media data, e.g. according to Base Media File Format as defined by the MPEG standardization organization, to improve stream delivery, in particular regarding HTTP (HyperText Transfer Protocol) streaming of user-selected regions of interest in compressed video streams. More particularly, the invention concerns a method, device, and computer program for encapsulation of an elementary stream containing spatial tiles allowing efficient streaming of one or more tiles.

BACKGROUND OF THE INVENTION

Video coding is a way of transforming a series of video images into a compact digitized bit-stream so that the video images can be transmitted or stored. An encoding device is used to code the video images, with an associated decoding device being available to reconstruct the bit-stream for display and viewing. A general aim is to form the bit-stream so as to be of smaller size than the original video information. This advantageously reduces the capacity required of a transfer network, or storage device, to transmit or store the bit-stream code. To be transmitted, a video bit-stream is generally encapsulated according to a transmission protocol that typically adds headers and check bits.

Recently, the Moving Picture Experts Group (MPEG) published a new standard to unify and supersede existing streaming solutions over HTTP (HyperText Transfer Protocol). This new standard, called "Dynamic adaptive streaming over HTTP (DASH)", is intended to support a media-streaming model over HTTP based on standard web servers, in which intelligence (i.e. selection of media data to stream and dynamic adaptation of the bit-streams to user choices, network conditions, and client capabilities) relies exclusively on client choices and devices.

In this model, a media presentation is organized in data segments and in a manifest called "Media Presentation Description (MPD)" that represents the organization of timed media data to be presented. In particular, a manifest comprises resource identifiers to use for downloading data segments and provides the context to select and combine those data segments to obtain a valid media presentation.

Resource identifiers are typically HTTP-URLs (Uniform Resource Locator), possibly combined with byte ranges. Based on a manifest, a client device determines at any time which media segments are to be downloaded from a media data server according to its needs, its capabilities (e.g. supported codecs, display size, frame rate, level of quality, etc), and depending on network conditions (e.g. available bandwidth).

In addition, video resolution is continuously increasing, going from standard definition (SD) to high definition (HD), and to ultra-high definition (e.g. 4K2K or 8K4K, that is to say video comprising images of 4,096 x 2,400 pixels or 7,680 x 4,320 pixels).

However, not all receiving and video decoding devices have resources (e.g. network access bandwidth or CPU (Central Processing Unit)) to access video in full resolution, in particular when video is of ultra-high definition, and not all users need to access such video. In such a context, it is particularly advantageous to provide the ability of accessing only some Regions-of-Interest (ROls) that is to say to access only some spatial sub-parts of a whole video sequence.

A known mechanism to access spatial sub-parts of frames belonging to a video consists in organizing each frame of the video as an arrangement of independently decodable spatial areas generally referred to as tiles. Some video formats such as SVC (Scalable Video Coding) or HEVC (High Efficiency Video Coding) provide support for tile definition. A user-defined ROI may cover one or several contiguous tiles.

Accordingly, for streaming user-selected ROls according to HTTP protocol, it is important to provide encapsulation of timed media data of an encoded video bit-stream in a way that enables spatial access to one or more tiles and that enables combination of accessed tiles.

It is to be recalled that encoded video bit-streams are generally constructed as a set of contiguous temporal samples that correspond to complete frames, the temporal samples being organized as a function of the decoding order. File formats are used to encapsulate and describe such encoded bit-streams.

For the sake of illustration, the International Standard Organization Base Media File Format (ISO BMFF) is a well-known flexible and extensible format that describes encoded timed media data bit-streams either for local storage or transmission via a network or via another bit-stream delivery mechanism. This file format is object-oriented. It is composed of building blocks called boxes that are sequentially or hierarchically organized and that define parameters of the encoded timed media data bit-stream such as timing and structure parameters. According to this file format, the timed media data bit-stream is contained in a data structure referred to as mdat box that is defined in another data structure referred to as track box. The track represents a timed sequence of samples where a sample corresponds to all the data associated with a single timestamp that is to say all the data associated with a single frame or all the data associated with several frames sharing the same timestamp.

For scalable video such as video of the SVC format, the layered media data organization can be efficiently represented by using multiple dependent tracks, each track representing the video at a particular level of scalability. In order to avoid data duplication between tracks, extractors can be used. According to a standard tile format, an extractor is a data structure directly included in a bit-stream that enables efficient extraction of network abstraction layer (NAL) units from other bit-streams. For instance, the bit-stream of an enhancement layer track may comprise extractors that reference NAL units from a base layer track. Then later on, when such enhancement layer track is extracted from the file format, extractors must be replaced by the data that they are referencing to.

Several strategies can be adopted when using ISO BMFF embedding these mechanisms to describe sub-information and to ease access to this sub-information or to efficiently organize bit-streams into multiple segments.

For example, in the article entitled "Implications of the ISO Base Media File Format on Adaptive HTTF Streaming of H.264/SVC", the authors, Kofler et al., present three different strategies for organizing a scalable video bit-stream (H264/SVC) for HTTP streaming considering possibilities as well as limitations of the ISO BMFF: a) a single file containing a particular file header comprising a file type box "ftyp" and a movie box "moov" containing all ISO BMFF metadata (including track definitions), the single file also comprising a single mdat box containing the whole encoded bit-stream. This organization is suitable for local storage but is not adapted to HTTP streaming where a client may only need a part of the whole bit-stream; b) a single file containing multiple moof/mdat boxes suitable for fragmentation. This format allows for progressive download. The moof box is equivalent to the moov box at fragment level. According to this scheme, using a fragmented media file, the scalable bit-stream is split into multiple dependent tracks representing the video at different scalability levels. Extractors are used to reference NAL units from other tracks. In case a track per tile is used, all addressable tracks have to be prepared in advance and tracks cannot be selected independently. If several tiles are to be displayed, several bit-streams must be decoded and the base layer is decoded several times; c) multiple segments files, each file being accessible by its own URL and being downloadable independently. Each segment typically consists of a segment type box (styp), which acts as a kind of file header, an optional segment index box (sidx) and one or multiple fragments. Again, each fragment consists of a moof and a mdat box. According to this scheme, using a fragmented media file, each track is stored in its own segment with the associated bit-stream related to one level of scalability. If necessary, extractors are used to reference required bit-stream from dependent tracks.

Such a coding scheme is particularly suitable for streaming tracks independently. It is well adapted to the DASH standard but it is not suitable for tile streaming since several bit-streams are to be decoded and thus, one decoder per track is required. Moreover, there is a potential duplication of the base layer's bit-stream when selecting more than one tile.

When applied to spatial tiles, none of these strategies allows efficient access to specific tiles in the context of HTTP stieaming. Indeed with existing file format definition, it would still be necessary to access a multiple number of non-continuous byte ranges in an encoded bit-stream or it would result in bit-stream duplication in order to display spatial tiles of several frames corresponding to a given time interval.

To solve these issues, there is provided an efficient data organization and track description scheme suitable for spatial tiles, which ensures, whatever track combination is selected by a client application, that the result of the ISO BMFF parsing always leads to a valid video elementary bit-stream for the video decoder.

SUMMARY OF THE INVENTION

Faced with these constraints, the inventors provide a method and a device for encapsulating tiled timed media data in a server and for providing a timed media data bit-stream from tiled timed media data encapsulated in a plurality of media segment files.

It is a broad object of the invention to remedy the shortcomings of the prior art as described above.

According to a first aspect of the invention there is provided a method of encapsulating tiled timed media data in a server, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the method comprising: selecting at least one spatial subsample from amongst the plurality of spatial subsamples of one of the timed samples; for each selected spatial subsample, creating one tile track comprising the selected spatial subsample and one corresponding spatial subsample of each of the other timed samples; creating at least one composite track comprising at least one extractor identifying at least one of the created tile tracks; and independently encapsulating each created track in at least one media segment file.

Accordingly, the invention enables the combination of different tiles and the creation of a valid file format and elementary stream at parsing time whatever the selected set of tiles downloaded by a client device.

Therefore, the method of the invention is suitable for efficient streaming of independent tiles, requires only useful data to be sent to a client device, is suitable for more than one tile selection (i.e. arbitrary ROl), reduces indexation overhead (compared to byte-range requests), and can be integrated in a MPEG standard.

In an embodiment, the at least one extractor further identifies at least one spatial subsample of the at least one identified tile track.

In an embodiment, the method further comprises a step of creating an initialization segment file comprising parameters allowing formation of a spatial portion of the tiled timed media data from the created tile tracks and the created one composite track. The initialization segment file may comprise references to the created tile tracks, a list of referenced tile tracks associated with the at least one created composite track, a tile handler associated with each of the created tile tracks indicating that the track contains spatial subsample information, and/or general presentation information associated with each of the created tile tracks.

In an embodiment, at least two tile tracks are created in the step of creating one tile track for each selected spatial subsample, the at least one composite track comprising at least one item of data that is to be used to decode any of two spatial subsamples comprised within the same timed sample.

In an embodiment, the tiled timed media data are scalable tiled timed media data according to which each of the timed samples comprises a reference layer and at least one enhancement layer, the reference layer comprising at least one reference spatial subsample and the at least one enhancement layer comprising a plurality of enhancement spatial subsamples, tile tracks created in the step of creating one tile track for each selected spatial subsample are enhancement tile tracks comprising enhancement spatial subsamples, and the method further comprises a step of creating at least one reference track comprising reference spatial subsamples. The reference layer may comprise a plurality of reference spatial subsamples and thus, a plurality of reference tracks can be created, each reference track of the plurality of reference tracks forming a reference tile track.

In an embodiment, media segment files resulting from encapsulating each of the created tracks comprise a data structure for storing metadata including definitions of the corresponding track. Media segment files resulting from encapsulating each of the created tracks may comprise a data structure for storing spatial subsamples. The media segment files resulting from encapsulating each of the created tracks may comprise a data structure for storing extractors.

In an embodiment, the server is compatible with the HyperText Transfer Protocol (HTTP).

In an embodiment, the media segment files resulting from encapsulating each of the created tracks are compatible with the Base Media File Format and the Dynamic Adaptive Streaming over HTTP format as defined by the International Standard Organization.

A second aspect of the invention provides a method of providing a timed media data bit-stream from tiled timed media data encapsulated in a plurality of media segment files, in a client device, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the plurality of media segment files comprising at least one composite track and at least one tile track, each of the at least one tile track comprising one spatial subsample selected from amongst the plurality of spatial subsamples of one of the timed samples and comprising one corresponding spatial subsample of each of the other timed samples, the composite track comprising at least one extractor identifying at least one tile track and identifying at least one spatial subsample of the identified tile track, the method comprising: selecting an item of information representative of a spatial subsample; requesting the at least one tile track comprising the selected spatial subsample, the selected spatial subsample corresponding to the selected item of information; receiving the plurality of media segment files; and replacing extractors of the composite track by corresponding identified spatial subsamples to generate the timed media data bit-stream.

In an embodiment, the method further comprises: obtaining a track reference from an extractor of the composite track; checking whether or not the track corresponding to the obtained track reference has been received; and if the track corresponding to the obtained track reference has not been received, removing corresponding extractor to generate the timed media data bit-stream.

In an embodiment, the method further comprises a step of verifying that the track corresponding to the obtained track reference is a track of tile type.

In an embodiment, the method further comprises: obtaining a track reference from an extractor of the composite track; checking whether or not the track corresponding to the obtained track reference has been received; and if the track corresponding to the obtained track reference has not been received, replacing corresponding extractor by padding to generate the timed media data bit-stream.

In an embodiment, the method further comprises a step of receiving an initialization segment file comprising parameters allowing formation of a spatial portion of the tiled timed media data from the created tile tracks and the created one composite track.

In an embodiment, at least two media segment files representative of at least two tile tracks are received, the composite track comprising at least one item of data that is to be used to decode any of two spatial subsamples belonging to the same set of spatial subsamples.

In an embodiment, tiled timed media data are scalable tiled timed media data according to which each of the timed samples comprises a reference layer and at least one enhancement layer, the reference layer comprising at least one reference spatial subsample and the at least one enhancement layer comprising a plurality of enhancement spatial subsamples, and at least two media segment files representative of at least two tracks are received, one of the at least two received tracks being a reference track comprising reference spatial subsamples and the other of the at least two received tracks being an enhancement tile track comprising enhancement spatial subsamples. The reference layer may comprise a plurality of reference spatial subsamples and thus, a plurality of reference tracks, each forming a reference tile track, can be received.

In an embodiment, the client device is compatible with the HyperText Transfer Protocol (HTTP).

In an embodiment, the received media segment files are compatible with the Base Media File Format and the Dynamic Adaptive Streaming over HTTP format as defined by the International Standard Organization.

According to a third aspect of the invention there is provided a device for encapsulating tiled timed media data in a server, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the device comprising at least one microprocessor configured for carrying out the steps of: selecting at least one spatial subsample from amongst the plurality of spatial subsamples of one of the timed samples; for each selected spatial subsample, creating one tile track comprising the selected spatial subsample and one corresponding spatial subsample of each of the other timed samples; creating at least one composite track comprising at least one extractor identifying at least one of the created tile tracks; and independently encapsulating each created track in at least one media segment file.

Therefore, the device of the invention is suitable for efficient streaming of independent tiles, requires only useful data to be sent to a client device, is suitable for more than one tile selection (i.e. arbitrary ROl), reduces indexation overhead (compared to byte-range requests), and can be integrated in a MPEG standard.

In an embodiment, the microprocessor is further configured for carrying out the step of creating an initialization segment file comprising parameters allowing formation of a spatial portion of the tiled timed media data from the created tile tracks and the created at least one composite track, the initialization segment file comprising references to the created tile tracks, a list of referenced tile tracks associated with the at least one created composite track, a tile handler associated with each of the created tile tracks indicating that the track contains spatial subsample information, and/or general presentation information associated with each of the created tile tracks.

In an embodiment, the microprocessor is further configured so that at least two tile tracks are created in the step of creating one tile track for each selected spatial subsample, the composite track comprising at least one item of data that is to be used to decode any of two spatial subsamples comprised within the same timed sample.

In an embodiment, the tiled timed media data are scalable tiled timed media data according to which each of the timed samples comprises a reference layer and at least one enhancement layer, the reference layer comprising at least one reference spatial subsample and the at least one enhancement layer comprising a plurality of enhancement spatial subsamples, the microprocessor being further configured so that tile tracks created in the step of creating one tile track for each selected spatial subsample are enhancement tile tracks comprising enhancement spatial subsamples, the microprocessor being further configured for carrying out the step of creating at least one reference track comprising reference spatial subsamples.

The reference layer may comprise a plurality of reference spatial subsamples and the microprocessor may be further configured so that a plurality of reference tracks are created, each reference track of the plurality of reference tracks forming a reference tile track.

In an embodiment, media segment files resulting from encapsulating each of the created tracks comprise a data structure for storing metadata including definitions of the corresponding track. Media segment files resulting from encapsulating each of the created tracks may comprise a data structure for storing spatial subsamples. Media segment files resulting from encapsulating each of the created tracks may comprise a data structure for storing extractors.

In an embodiment, the server is compatible with the HyperText Transfer Protocol and the media segment files resulting from encapsulating each of the created tracks are compatible with the Base Media File Format as defined by the International Standard Organization.

According to a fourth aspect of the invention there is provided a video encoder comprising the device as described above.

According to a fifth aspect of the invention there is provided a device for providing a timed media data bit-stream from tiled timed media data encapsulated in a plurality of media segment files, in a client device, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the plurality of media segment files comprising at least one composite track and at least one tile track, each of the at least one tile track comprising one spatial subsample selected from amongst the plurality of spatial subsamples of one of the timed samples and comprising one corresponding spatial subsample of each of the other timed samples, the at least one composite track comprising at least one extractor identifying at least one tile track and identifying at least one spatial subsample of the identified tile track, the device comprising at least one microprocessor contigured for carrying out the steps of: selecting an item of information representative of a spatial subsample; requesting the at least one tile track comprising the selected spatial subsample, the selected spatial subsample corresponding to the selected item of information; receiving the plurality of media segment files; and replacing extractors of the at least one composite track by corresponding identified spatial subsamples to generate the timed media data bit-stream.

In an embodiment, the microprocessor is further configured for carrying out the steps of: obtaining a track reference from an extractor of the at least one composite track; checking whether or not the track corresponding to the obtained track reference has been received; and if the track corresponding to the obtained track reference has not been received, removing corresponding extractor to generate the timed media data bit-stream.

In an embodiment, the microprocessor is further configured for carrying out the step of verifying that the track corresponding to the obtained track refeience is a track of tile type.

In an embodiment, the microprocessor is further configured for carrying out the steps of: obtaining a track reference from an extractor of the at least one composite track; checking whether or not the track corresponding to the obtained track reference has been received; and if the track corresponding to the obtained track reference has not been received, replacing corresponding extractor by padding to generate the timed media data bit-stream.

In an embodiment, the microprocessor is further configured for carrying out the step of verifying that the track corresponding to the obtained track reference is a track of tile type.

In an embodiment, the microprocessor is further configured for carrying out the step of receiving an initialization segment file comprising parameters allowing forniation of a spatial portion of the tiled timed media data from the created tile tracks and the created at least one composite track.

In an embodiment, the microprocessor is further configured so that at least two media segment files representative of at least two tile tracks are received, the at least one composite track comprising at least one item of data that is to be used to decode any of two spatial subsamples belonging to the same set of spatial subsamples.

In an embodiment, the tiled timed media data are scalable tiled timed media data according to which each of the timed samples comprises a reference layer and at least one enhancement layer, the reference layer comprising at least one reference spatial subsample and the at least one enhancement layer comprising a plurality of enhancement spatial subsamples, and the microprocessor is further configured so that at least two media segment files representative of at least two tracks are received, one of the at least two received tracks being a reference track comprising reference spatial subsamples and the other of the at least two received tracks being an enhancement tile track comprising enhancement spatial subsamples.

In an embodiment, the client device is compatible with the HyperText Transfer Protocol (HTTP) and the received media segment files are compatible with the Base Media File Format and the Dynamic Adaptive Streaming over HTTP format as defined by the International Standard Organization.

According to a sixth aspect of the invention there is provided a video decoder comprising the device as described above.

Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.

BRIEF DESCRIPTION OF THE DRAWINGS

Further advantages of the present invention will become apparent to those skilled in the art upon examination of the drawings and detailed description. It is intended that any additional advantages be incorporated herein.

Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which: Figure 1, comprising Figures la and lb, illustrates an example of coding tiles in an encoded video bit-stream; Figure 2 illustrates a temporal pipe of tiles selected by a user to be displayed; Figure 3 illustrates an example of block diagrams of an initialization segment file using media data tracks according to a particular embodiment; Figure 4 illustrates an example of block diagrams of media segment files comprising tile tracks and one composite track, according to a particular embodiment, corresponding to tracks declared in an initialization segment file such as the one shown in Figure 3; Figure 5 illustrates an example of concatenating downloaded media data segments to build a valid decodable timed media data bit-stream representing a spatial part of consecutive video frames for a given temporal period; Figure 6, comprising Figure 6a and Figure 6b, is a flow chart illustrating steps for transmitting timed media data between a server and a client device according to a particular embodiment; Figure 7 is a flow chart illustrating the generation of a valid timed media data bit-stream from concatenated media segments received by a client device; and Figure 8 represents a block diagram of a server or a client device in which steps of one or more embodiments may be implemented.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

According to a particular embodiment, tiled timed media data (e.g. video data) comprising timed samples (e.g. images) are transmitted as a set of several timed media data tracks, typically a base layer track and several tile tracks, and a composite track that comprises references to timed media data tracks. Each tile track comprises one spatial subsample (e.g. several NAL units) of several timed samples. An extended extractor type is defined for referencing timed media data tracks from a composite track. Timed media data tracks are labeled as not displayable and convey and describe timed media data for tiles. Such a set of timed media data tracks and a composite track allows the selecting, composing, and efficient streaming of spatial video tiles. Each track can be transmitted from a server device to a client device as a set of media segment files. An initialization segment file can be used to transmit metadata required to decode media segment files.

Figure 1, comprising Figure la and Figure ib, illustrates an example of coding tiles in an encoded video bit-stream.

For the sake of illustration, it is considered in the following description that each video frame (timed sample) is composed of independently decodable tiles corresponding to spatial sub-parts (spatial subsamples) of the video frame. The video is preferably scalable and organized in different levels of scalability. As illustrated in Figure la, a video frame 100 may comprise an HD base layer (102) and a 4K2K enhancement layer (104). Still for the sake of illustration, enhancement layer 104 can be divided into four regular tiles denoted a, b, c, and d. It is to be noted that tiles of different shapes may be handled. Likewise, base layer 102 can be divided into several tiles. In such a case, several composite tracks can be used, for example one for the base layer and one for the enhancement layers or for each of the enhancement layers.

It is also to be noted that the invention is not limited to scalable video forniat. It can apply to all video formats allowing tiles to be decoded independently.

Accordingly, any video compression algorithms like MPEG4, AVC, HEVC, SVC, or future SHVC can be used in conjunction with an embodiment of the invention.

Figure lb represents a typical encoded video bit-stream in a decoding order. As illustrated, the encoded video bit-stream comprises here three video frames (110, 112, and 114) encoded in a temporal order. Each video frame comprises all the network abstraction layer (NAL) units of the base layer (BL) followed by the NAL units of the enhancement layer. For example, the NAL units (1BL, 116) of the base layer (102-1) of the first video frame (110) are followed by the NAL units (icommon, la, lb, 1 c, 1 d, 118) of the enhancement layer (104-1) of the first video frame.

The part of the video bit-stream corresponding to an enhancement layer with spatial tiles is composed of NAL units of each tile. Optionally, it may also contain NAL units that are common to all tiles and that are required to decode any of the tiles.

The NAL units that are common to all tiles of a given frame can be located anywhere in the corresponding part of the video bit-stream (i.e. before, between, or after the NAL units of the tiles of the video frame).

As illustrated, the part of the video bit-stream corresponding to the enhancement layer of the first video frame (110), comprising spatial tiles a, b, c, and d, is composed of NAL units for each tile (la, ib, ic, and ld) and of NAL units (1 common) that are common to all tiles a, b, c, and d.

Figure 2 illustrates a temporal pipe of tiles selected by a user to be displayed. More precisely, Figure 2 represents a first video frame n and a second video frame n+m (where n and m are integer values), each of the first and second video frames comprising twelve tiles numbered ito 12. Amongst these twelve tiles, only the third and seventh ones are to be displayed (as denoted with bold lines). Video frames n and n+m belong to a series of consecutive frames corresponding to a given temporal period. Therefore, the third and seventh tiles of each frame from frame n to frame n+m are displayed consecutively.

As illustrated in Figure 1, the data of a video bit-stream are organized as temporal samples that correspond to full frames. Accordingly, it is required to access several small byte ranges for each frame when particular spatial areas of these frames are to be accessed during a given period of time as described above by reference to Figure 2. This is inefficient in HTTP streaming in terms of the number of generated requests and in terms of data overhead.

Therefore, to provide a more efficient access in compressed videos for ROl streaming, the timed media data bit-stream is to be reorganized so that data of a particular tile are organized as a contiguous byte range (forming a pipe) for a given time period (i.e. a set of consecutive frames).

Accordingly, when only a spatial sub-part of video frames is to be displayed, only pipes of tiles corresponding to the selected spatial area must be downloaded (for instance tiles 3 and 7 in Figure 2) using one HTTP request per pipe and per time period.

Figure 3 illustrates an example of block diagrams of an initialization segment file using media data tracks according to a particular embodiment.

The initialization segment file illustrated in Figure 3 and the media segment files illustrated in Figure 4 illustrate an embodiment of the invention to reorganize and encapsulate a video bit-stream in ISO Base Media File Format while several tracks are streamed independently. Foi the sake of illustration, the content of Figures 3 and 4 are based on the video bit-stream illustrated in Figure 1.

To reorganize and encapsulate a video bit-stream in ISO BMFF while several tracks are streamed independently, a new type of track called tile track is defined. A tile track is a track that is, by definition, a timed sequence of related samples where a sample represents all the data associated with a single timestamp. Contrary to a known video media track where a sample is typically an individual video frame, a sample of a tile track defines a spatially-defined sub-part of a full video frame.

Accordingly, tile tracks only contain NAL units related to a given tile. In this way, it is possible to create pipes of tiles with contiguous byte ranges by storing each track in independent segment files.

However, since a tile track does not contain all the required NAL units and information to be decoded and played, an additional track called "composite track" is used. According to a particular embodiment, a composite track is a media track representing a complete tiled frame (i.e. the composition of all tiles). A composite track is the representation of an encoded timed media data bit-stream in its typical organization as desciibed by reference to Figure 1 (trame by frame in decoding order).

The composite track uses extractor objects to refer to NAL units in their respective tile tracks. In addition it may contain NAL units that are common to all tiles if such NAL units exist.

According to a paiticular embodiment, an initialization segment file is used to transmit all the metadata that are necessaiy to define timed media data bit-stieams encapsulated in other media segment files. As illustrated in Figure 3, the initialization segment file 300 contains a file type box "ftyp" 302 and a movie box "moov" 304. File type box 302 preferably identifies which ISO BMF specifications the segment files comply with and indicates a version number of that specification. Movie box "moov" 304 provides all the metadata describing the presentation stored in media segment files and in particulai all tiacks available in the piesentation.

Movie box "moov" 304 contains a definition for each of the tracks ("track" boxes 306-1 to 306-6) corresponding to the scalable video bit-stream provided as an

example in Figure 1.

Track box 306-1 represents the base layei (track_ID=1), the four tiack boxes 306-2 to 306-5 (tiack boxes 306-3 and 306-4 not being shown) represent the four tiles a, b, c, and d of the enhancement layer (track_1D2 to 5), and track box 306-6 represents a composite track describing the enhancement layer (track_lD=6).

Each track box contains at least a track header box "tkhd", genetically referenced 308, and a track media box "mdia" generically referenced 310. If a track depends on data from other tracks, there is also a track reference box "fret". As illustrated, the composite track having the identifier track_lD=6 comprises track reference box tie?' 312 indicating that the track depends on data from tracks having the identifiers track 1D1 to 6.

It is to be noted that other boxes may be mandatory or optional depending on ISO BMFF specifications used to encapsulate the timed media data bit-stream.

Howevei, since embodiments of the invention do not rely on these boxes to be applicable, they are not presented here.

Track header box "tkhd" 308 specifies the characteristics of the track.

Among several items of intormation, it piovides the track's identifier (tiack_ID), the track's duration, and/or the track's visual presentation size (i.e. width and height of the display area). It also comprises a flag parameter that indicates whether or not the track is playable.

According to an embodiment, the default value of the track header flag for tile tracks is 0 (track_enabled = 0, track_in_movie = 0, track_in_preview = 0), meaning that tile tracks are ignored for local playback and preview by a client device. In another embodiment, a new track header flag can be created to signal that the track is a tile track.

Track media box "mdia" 310 can be seen as a container containing all the objects used to declare parameters of timed media data within a track. It contains at least a media header box "mdhd" generically referenced 314, a handler reference box "hdlr" generically referenced 316, and a media information box "minf" generically referenced 318.

Handler reference box "hdlr" 316 declares the process by which the timed media data of the track is to be presented and thus, the nature of the timed media data in the track. For example, a video track would be handled by a video handler (noted with the handler type attribute equal to vide'). A video sample can be described by the use of an object of the VisualsampleEntry() type. According to a particular embodiment, a new handler type, called a tile handler (noted with the handler type attribute equal to tile'), is defined to indicate that the track contains spatial subsample information. Depending on the coding format, if an object of the VisualSampleEntry() type cannot describe a sample in a tile track, it is possible to define a specific object of the TilesampleEntry() type to describe the sample.

Media information box "minI" 318 contains all the objects that define characteristic information of the timed media data in the track. For instance, for the base layer and for the enhancement layer defined in the composite track, the "minI" box may contain a standard video media header box "vmhd" (320).

Regarding tile tracks, a specific box referred to as Tile Media Header Box (tmhd, 322), corresponding to the new tile handler, is used to define general presentation information, independent of the coding, for tiles. In particular, it may contain the geometry information of the spatial area covered by the tile relative to the video resolution represented by the composite video track, which can be defined as follow: a/igned(8) c/ass Ti/eMediaHeaderBox extends Fu//Box("tmhd', version = 0, 0) { unsigned int(1 6) horizontaL offset; unsigned int(16) vertical offset; ) As described previously, the composite track comprises specific track reference box "tref' 312 that provides typed reference to another track in the presentation. According to a particular embodiment, such typed references may comprise a "tile" reference (324) that can be used to establish a link from the composite track to the tile track that it refers to and a "scal" reference (326) that can be used to establish a link from the track containing this reference to the timed media data track that it depends on (e.g. the base layer track (track_1D1)).

Figure 4 illustrates an example of block diagrams of media segment files comprising tile tracks and one composite track, according to a particular embodiment, corresponding to tracks declared in an initialization segment file such as the one shown in Figure 3. As described previously, the initialization segment file illustrated in Figure 3 and the media segment files illustrated in Figure 4 illustrate an embodiment of the invention to reorganize and encapsulate a video bit-stream in ISO Base Media File Format while several tracks are streamed independently.

As illustrated in Figure 4, each of media segment files 400-1 to 400-6 (media segment files 400-3 to 400-5 being not shown) comprises, as specified in the DASH standard, a segment type box "styp" generically referenced 402, at least one movie fragment box "moot" generically referenced 404, and at least one media data box "mdat" generically referenced 406. A media segment file is associated with an H TTP-U RL.

However, since embodiments of the invention do not rely on these boxes to be applicable, they are not presented here.

The format of segment type box "styp" 402 is similar to the one of file type box "ftyp" 302 in Figure 3 however, its reference indicates that the file is a media segment file.

Movie fragment box 404 provides the information that is generally stored within the movie box "moov". Its header ("mfhd") contains a sequence number (labeled seq_num in Figure 4) that increases for each movie fragment. Such a sequence number allows a client device to concatenate received segment files in increasing order and to verify integrity of the sequence (if necessary). The movie fragment box 404 contains a track fragment box "traf" (generically referenced 408) for each track having data in an associated media data box ("mdat", 406). Track fragment box 408 comprises a track fragment header box "tfhd", generically referenced 410, that is used to store an identifier (track_ID) of the track's bit-stream present in the corresponding media data box (mdat", 406).

A media data box generally contains timed media data. In standard video tracks, it contains video frames. In tile tracks, media data box 406 contains spatially-defined sub-parts of complete video frames. For the sake of illustration, the media data box associated with track identifier track_ID=2 contains all NAL units corresponding to the tile of the enhancement layer.

In a composite track (track_ID=6 in Figure 4), media data box 406 contains extractors (labeled E in Figure 4) for each tile and for each dependent layer and contains NAL units common to all tiles (if any).

As illustrated in Figure 4, the media data box 406 of media segment file 400-6 associated with a composite track comprises, in particular: -a first extractor 412-1 that provides a link to base layer data (NAL units 1 BL) encoded within a base layer track stored in the media data box 406 of media segment file 400-1 associated with a base layer track; -NAL units 412-2 that are common to several tiles; -a second extractor 412-3 that provides a link to enhancement layer data (NAL units la) of a first tile encoded within media data box 406 of media segment file 400-2 associated with a first tile track of the enhancement layer; -a third extractor 412-4 that provides a link to enhancement layer data (NAL units, 1 b) of a second tile encoded within media data box 406 of media segment file 400-3 (not shown) associated with a second tile track of the enhancement layer; -a fourth extractor 412-5 that provides a link to enhancement layer data (NAL units, lc) of a third tile encoded within media data box 406 of media segment file 400-4 (not shown) associated with a third tile track of the enhancement layer; and -a fifth extractor 412-6 that provides a link to enhancement layer data (NAL units, 1 d) of a fourth tile encoded within media data box 406 of media segment file 400- 5 (not shown) associated with a fourth tile track of the enhancement layer.

NAL units that may be obtained thanks to extractor 412-1 allow the decoding of a base layer of a frame whose enhancement layer may be fully decoded using NAL units 412-2 and NAL units that may be obtained thanks to extractors 412-3 to 412-6. As can be observed from Figure 4, if only a spatial part of a frame is to be decoded, it is not necessary to download all media segment files 400-2 to 400-5 (i.e. bit-streams corresponding to tile tracks).

According to a particular embodiment, an extractor is a file format internal structure having the following syntax: class aligned(8) Extractor 0 { NAL UnitHeaderO; unsigned int(8) track_re fndex; signed int(8) sample_offset; unsigned inttengthSizeMinusOne+1) *8) data_offset; unsigned intIengthSizeMinusOne+1) *8) data_length;

I

where NALUnitHeader() represents the first four bytes of a NAL unit compliant with the coding format used to encode the video bit-stream. These four bytes identify the NAL unit as an extractor (e.g., in SVC the attribute nal_unit_type is set to the extractor NAL unit type (type 31)).

The value track_ref_index specifies the index to be used, in the track reference box "tref' of type scal' or tile' of a composite track, for finding the track from which data are to be extracted. The value sample_offset gives the relative index of the sample in the linked track that is to be used as the source of information. The values data_offset and data_length are the offset of the first byte within the reference sample to copy and the number of bytes to copy, respectively.

For the sake of illustration and with reference to Figure 3, if the value of track_ref_index of a given extractor is equal to 2, this means that the extractor references the track identified by the second entry in the tref box (i.e. track having identifier track_1D2, that is the tile track for tile a, the first index representing the reference track (e.g. base layer)).

Figure 5 illustrates an example of concatenating downloaded media data segments to build a valid decodable timed media data bit-stream representing a spatial part of consecutive video frames for a given temporal period (corresponding here to two consecutive frames). The same figure could be repeated for other temporal periods.

As described with reference to Figures 3 and 4, a timed tiled media data bit-stream is preferably transmitted as a set of data comprising one initialization segment file and a plurality of media segment files, the latter comprising several tile tracks and one composite track.

The initialization segment file comprises a movie box ("moov") that provides general information on each track, in particular the type of track (e.g. media track (audio or video) or tile track), a coding format, a frame resolution and the dependence among tracks (given in a track reference box "tref"). These data are used to process downloaded media segment files. Referring to the example described with reference to Figures 1, 3, and 4, the content of the movie box of the initialization segment file can comprise, in particular, the following:

MOOV

-track 1: base layer -track 2: tile a -track 3: tile b -track 4: tile c -track5:tilecl -track 6: enhancement layer o tref (scat): track_ID = I o tref(tile): track ID=2 track ID=3 track ID4 track ID=5 Figure 5 roughly illustrates the file format obtained by concatenating media segments when only required media segment files (corresponding here to the base layer and tiles a and c in the enhancement layer) are downloaded from a server. It is to be noted that not only does such a mechanism allow downloading of only the required media segment files but it also prevents downloading of duplicate data.

As illustrated, composite track 500 allows the building of a valid decodable timed media data bit-stream 502 by referencing data from base layer track 504 (in case of scalability) and from non-playable tile tracks (506 and 508) and by handling appropriately extractors referencing missing data (as described by reference to Figure 7).

The obtained file format is compliant with scalable file format definition. For example, a client device can decide to play only the base layer by selecting the base layer track or the enhancement layer with the chosen tiles a and c by selecting the composite track. The client device can also change the tiles to be displayed by downloading different "tile tracks" (i.e. media segment tiles) in a later temporal period while it continues to play the composite track.

Figure 6, comprising Figure 6a and Figure 6b, is a flow chart illustrating steps for transmitting timed media data between a server and a client device according to a particular embodiment. Steps shown in Figure 6a are implemented within the server to prepare a media presentation by creating segment files adapted to ROl streaming from a tiled timed media data bit-stream while steps shown in Figure 6b are implemented within the client device.

In a first step (step 600), the server identifies all NAL units that are associated with the tiles and, for each tile, creates a tile track containing sub-samples composed of all NAL units corresponding to the given tile. For example, the server may rely on sub-picture level SEI messages to identify the association of NAL units with different regions and on sequence-level SEI messages for identifying the position and size of each ROl as it has been proposed in HEVC standardization (proposal JCTVC-K0128). Accordingly, the server can create pipes of tiles for given periods of time.

In a following step (step 602), the server creates a composite track containing extractors linked to lower level scalability tracks (if any), NAL units common to all tiles, and extractors linked to each tile track. Extractors and common NAL units are advantageously ordered such that replacing the extractors by the data they are referencing results in a valid timed media data bit-stream composed of complete samples in decoding order (as described with reference to Figure 1).

Next, in step 604, the server generates and stores an initialization segment file and media segment files containing temporal period according to the ISO BMFF representation, as described with reference to Figures 3 and 4. All the timed media data tracks (e.g. video tracks), composite tracks, and tile tracks are stored in separate media segment files.

The server then serves, on request, the initialization and media segment files to a client device (step 606). The server may be a conventional HTTP server that responds to HTTP requests.

In the context of HTTP streaming and in a preferred embodiment, it is assumed that the client device has access to a manifest file describing the media presentation available from the server. This manifest file provides sufficient information (media properties and a list of segments) for the client device to stream the media presentation by first requesting the initialization segments and then media segment files from the server.

Upon selection of a ROl at the client device end, typically on a display with selecting means such as a pointing device, during the streaming of a tiled video, the tiles corresponding to the selected ROl are determined (step 608).

Next, for each temporal period, in case of scalable media data, the client device sends a request to the server to download the segment files corresponding to dependent layers (step 610). According to a particular embodiment, the layers that are depended from are downloaded before the layers depending from those depended from layers. For example, base layer segment files are downloaded before enhancement layer segment files.

In a following step, the client device sends a request to the server to download the media segment files corresponding to selected tiles (step 612) and sends a request to the server to download the media segment files corresponding to the composite track (step 614).

Next, the downloaded segment files are concatenated by the client device to build a valid (decodable) timed media data bit-stream conforming to the ISO BMFF standard (step 616), corresponding to the selected ROI.

Step 616 is described in details with reference to Figure 7.

Figure 7 is a flow chart illustrating the generation of a valid timed media data bit-stream from concatenated media segments received by a client device, that is to say the generation of a decodable bit-stream from selected tiles and one composite track when the composite track is played by the client device.

In a first step (step 700), the client device receives the media segment files that have been previously requested (e.g. steps 612, 614, and 616 in Figure 6) and a test is performed to determine whether or not at least one media segment file has been received (step 702). If no media segment file has been received, the process ends.

If at least one media segment file has been received and if the received media segment files do not contain a composite track (i.e. they contain dependent tracks, e.g. lower level layer or tile tracks), they are buffered for later usage.

On the contrary, if at least one media segment has been received and if the received media segment file contains a composite track, the bit-stream (typically NAL units) corresponding to the composite track is parsed to extract the first item of data (or the next item of data if at least one item of data of the received media segment has been processed, typically a NAL unit) from the media data box mdat" of the received media segment (step 704). If there is no further item of data to parse in the media data box, the process returns to step 700 to receive new media segment files (step 706).

Next, a test is performed to determine whether or not the extracted item of data (e.g. extracted NAL unit) corresponds to an extractor (step 708). If the extracted item of data does not correspond to an extractor, it is returned as is to be further decoded by a video decoder (step 710). On the contrary, if the extracted item of data is an extractor, it must be replaced by the item of data it is referencing. To that end, the values of the extractor's parameters are obtained from its structure (step 712). As described above, an extractor comprises all the parameter values required to extract data from another track (e.g., track_ref_index, sample_offset, data_offset, and data_length).

Once the identifier of the referenced track has been identified, a test is performed (step 714) to determine whether or not the referenced track is available in the set of media segment files buffered during step 700. It is to be recalled that some tile tracks are missing since the client device downloads only the media segment files corresponding to the selected Region-of-Interest.

If the referenced track is available in the set of media segment files buffered during step 700, the extractor is replaced by the data it is referencing (step 716) and the bit-stream is sent to a video decoder to be decoded (step 710).

If the referenced track is not available in the set of media segment files buffered during step 700, specific steps have to be performed since the absence of data referenced in an extractor leads to a fatal error according to the ISO BMF standard. A test is performed to determine whether or not the referenced track is a tile track (the referenced track can correspond to a dependent scalability layer) and whether or not the extractor is of the tile type (step 718).

If the referenced track is not a tile track or if the extractor is not of the tile type, a standard fatal error is detected. On the contrary, if the referenced track is a tile track and if the extractor is of the tile type, the extractor is removed (step 722) or the extractor is replaced by padding from an alternative "padding track" or "padding box" containing skipped' data for the missing tiles (step 724), depending on the coding format used to encode the timed media data bit-stream (step 720). Here, skipped' data represent pixel data missing in a current image that are replaced by other pixel data obtained from a previously decoded image either belonging to a same scalable layer or belonging to another scalable layer. Skipped data are generally represented by at least one flag. For example, when considering HEVC video compression format, the padding data can be one or more NALUs that exclusively contain coding units encoded with a skip flag set to 1.

Next, the bit-stream is transmitted to a video decoder to be decoded (step 710) and displayed and the process ioops to step 704 to handle a following item of data.

Figure 8 represents a block diagram of a server or a client device 800 in which steps of one or more embodiments may be implemented.

Preferably, the device 800 comprises a communication bus 802, a central processing unit (CPU) 804 capable of executing instructions from program ROM 806 on powering up of the device, and instructions relating to a software application from main memory 808 after the powering up. The main memory 808 is for example of Random Access Memory (RAM) type which functions as a working area of CPU 804 via the communication bus 802, and the memory capacity thereof can be expanded by an optional RAM connected to an expansion port (not illustrated). Instructions relating to the software application may be loaded to the main memory 808 from a hard-disc (HO) 810 or the program ROM 806 for example. Such software application, when executed by the CPU 804, causes the steps described with reference to Figures 6a to be performed in the server and the steps described with reference to Figures 6b and 7 to be performed in the client device.

Reference numeral 812 is a network interface that allows the connection of the device 800 to the communication network 814. The software application when executed by the CPU 804 is adapted to react to requests received through the network interface and to provide data streams and requests via the network to other devices.

Reference numeral 816 represents user interfaces to display information to, and/or receive inputs from, a user.

It should be pointed out here that, as a variant, the device 800 for managing the reception or sending of multimedia bit-streams can consist of one or more dedicated integrated circuits (ASIC) that are capable of implementing the method as described with reference to Figures 6a, 6b and 7. These integrated circuits are for example and non-restrictively, integrated into an apparatus for generating or displaying video sequences and/or for listening audio sequences.

Naturally, in order to satisfy local and specific requirements, a person skilled in the art may apply to the solution described above many modifications and alterations all of which, however, are included within the scope of protection of the invention as defined by the following claims.

Claims

CLAIMS1. A method of encapsulating tiled timed media data in a server, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the method comprises: selecting at least one spatial subsample from amongst the plurality of spatial subsamples of one of the timed samples; for each selected spatial subsample, creating one tile track comprising the selected spatial subsample and one corresponding spatial subsample of each of the other timed samples; creating at least one composite track comprising at least one extractor identifying at least one of the created tile tracks; and independently encapsulating each created track in at least one media segment file.
2. The method according to claim 1 wherein the at least one extractor further identifies at least one spatial subsample of the at least one identified tile track.
3. The method according to claim 1 further comprising a step of creating an initialization segment file comprising parameters allowing formation of a spatial portion of the tiled timed media data from the created tile tracks and the created at least one composite track.
4. The method of claim 3 wherein the initialization segment file comprises references to the created tile tracks.
5. The method of claim 3 wherein the initialization segment file further comprises a list of referenced tile tracks associated with the at least one created composite track.
6. The method of claim 3 wherein the initialization segment file further comprises a tile handler associated with each of the created tile tracks indicating that the track contains spatial subsample information.
7. The method of claim 6 wherein the initialization segment file further comprises general presentation information associated with each of the created tile tracks.
8. The method according to claim 1 wherein at least two tile tracks are created in the step of creating one tile track for each selected spatial subsample, the at least one composite track comprising at least one item of data that is to be used to decode any of two spatial subsamples comprised within the same timed sample.
9. The method according to claim 1 wherein the tiled timed media data are scalable tiled timed media data according to which each of the timed samples comprises a reference layer and at least one enhancement layer, the reference layer comprising at least one reference spatial subsample and the at least one enhancement layer comprising a plurality of enhancement spatial subsamples, and wherein tile tracks created in the step of creating one tile track for each selected spatial subsample are enhancement tile tracks comprising enhancement spatial subsamples, the method further comprising a step of creating at least one reference track comprising reference spatial subsamples.
10. The method according to claim 9 wherein the reference layer comprises a plurality of reference spatial subsamples and wherein a plurality of reference tracks are created, each reference track of the plurality of reference tracks forming a reference tile track.
11. The method of claim 1 wherein media segment files resulting from encapsulating each of the created tracks comprise a data structure for storing metadata including definitions of the corresponding track.
12. The method of claim 11 wherein media segment files resulting from encapsulating each of the created tracks comprise a data structure for storing spatial subsamples.
13. The method of claim 11 wherein media segment files resulting from encapsulating each of the created tracks comprise a data structure for storing extractors.
14. The method of claim 1 wherein the server is compatible with the HyperText Transfer Protocol (HTTP).
15. The method of claim 14 wherein the media segment files resulting from encapsulating each of the created tracks are compatible with the Base Media File Format and the Dynamic Adaptive Streaming over HTTP format as defined by the International Standard Organization.
16. A method of providing a timed media data bit-stream from tiled timed media data encapsulated in a plurality of media segment files, in a client device, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the plurality of media segment files comprising at least one composite track and at least one tile track, each of the at least one tile track comprising one spatial subsample selected from amongst the plurality of spatial subsamples of one of the timed samples and comprising one corresponding spatial subsample of each of the other timed samples, the at least one composite track comprising at least one extractor identifying at least one tile track and identifying at least one spatial subsample of the identified tile track, the method comprises: selecting an item of information representative of a spatial subsample; requesting the at least one tile track comprising the selected spatial subsample, the selected spatial subsample corresponding to the selected item of information; receiving the plurality of media segment files; and replacing extractors of the at least one composite track by corresponding identified spatial subsamples to generate the timed media data bit-stream.
17. The method of claim 16 that further comprises: obtaining a track reference from an extractor of the at least one composite track; checking whether or not the track corresponding to the obtained track reference has been received; and if the track corresponding to the obtained track reference has not been received, removing corresponding extractor to generate the timed media data bit-stream.
18. The method of claim 17 further comprising a step of verifying that the track corresponding to the obtained track reference is a track of tile type.
19. The method of claim 16 that further comprises: obtaining a track reference from an extractor of the at least one composite track; checking whether or not the track corresponding to the obtained track reference has been received; and if the track corresponding to the obtained track reference has not been received, replacing corresponding extractor by padding to generate the timed media data bit-stream.
20. The method of claim 19 further comprising a step of verifying that the track corresponding to the obtained track reference is a track of tile type.
21. The method according to claim 16 further comprising a step of receiving an initialization segment file comprising parameters allowing formation of a spatial portion of the tiled timed media data from the created tile tracks and the created at least one composite track.
22 The method according to claim 16 wherein at least two media segment files representative of at least two tile tracks are received, the at least one composite track comprising at least one item of data that is to be used to decode any of two spatial subsamples belonging to the same set of spatial subsamples.
23. The method according to claim 16 wherein the tiled timed media data are scalable tiled timed media data according to which each of the timed samples comprises a reference layer and at least one enhancement layer, the reference layer comprising at least one reference spatial subsample and the at least one enhancement layer comprising a plurality of enhancement spatial subsamples, and wherein at least two media segment tiles representative of at least two tracks are received, one of the at least two received tracks being a reference track comprising reference spatial subsamples and the other of the at least two received tracks being an enhancement tile track comprising enhancement spatial subsamples.
24. The method according to claim 23 wherein the reference layer comprises a plurality of reference spatial subsamples and wherein a plurality of reference tracks, each forming a reference tile track, are received.
25. The method of claim 16 wherein the client device is compatible with the HyperText Transfer Protocol (HTTP).
26. The method of claim 25 wherein the received media segment files are compatible with the Base Media File Format and the Dynamic Adaptive Streaming over HTTP format as defined by the International Standard Organization.
27. A computer program product for a programmable apparatus, the computer program product comprising instructions for carrying out each step of the method according to any one of claims 1 to 26 when the program is loaded and executed by a programmable apparatus.
28. A computer-readable storage medium storing instructions of a computer program for implementing the method according to any one of claims 1 to 26.
29. A device for encapsulating tiled timed media data in a server, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the device comprising at least one microprocessor configured for carrying out the steps of: selecting at least one spatial subsample from amongst the plurality of spatial subsamples of one of the timed samples; for each selected spatial subsample, creating one tile track comprising the selected spatial subsample and one corresponding spatial subsample of each of the other timed samples; creating at least one composite track comprising at least one extractor identifying at least one of the created tile tracks; and independently encapsulating each created track in at least one media segment file.
30. The device according to claim 29 wherein the microprocessor is further configured for carrying out the step of creating an initialization segment file comprising parameters allowing formation of a spatial portion of the tiled timed media data from the created tile tracks and the created at least one composite track, the initialization segment file comprising references to the created tile tracks.
31. The device of claim 30 wherein the initialization segment file further comprises a list of referenced tile tracks associated with the at least one created composite track.
32. The device of claim 30 wherein the initialization segment file further comprises a tile handler associated with each of the created tile tracks indicating that the track contains spatial subsample information.
33 The device of claim 32 wherein the initialization segment tile further comprises general presentation information associated with each of the created tile tracks.
34. The device according to claim 29 wherein the microprocessor is further configured so that at least two tile tracks are created in the step of creating one tile track for each selected spatial subsample, the at least one composite track comprising at least one item of data that is to be used to decode any of two spatial subsamples comprised within the same timed sample.
35. The device according to claim 29 wherein the tiled timed media data are scalable tiled timed media data according to which each of the timed samples comprises a reference layer and at least one enhancement layer, the reference layer comprising at least one reference spatial subsample and the at least one enhancement layer comprising a plurality of enhancement spatial subsamples, the microprocessor being further configured so that tile tracks created in the step of creating one tile track for each selected spatial subsample are enhancement tile tracks comprising enhancement spatial subsamples, the microprocessor being further configured for carrying out the step of creating at least one reference track comprising reference spatial subsamples.
36. The device according to claim 35 wherein the reference layer comprises a plurality of reference spatial subsamples and wherein the microprocessor is further configured so that a plurality of reference tracks are created, each reference track of the plurality of reference tracks forming a reference tile track.
37. The device of claim 29 wherein media segment files resulting from encapsulating each of the created tracks comprise a data structure for storing metadata including definitions of the corresponding track.
38. The device of claim 37 wherein media segment files resulting from encapsulating each of the created tracks comprise a data structure for storing spatial subsamples.
39. The device of claim 37 wherein media segment files resulting from encapsulating each of the created tracks comprise a data structure for storing extractors.
40. The device of claim 29 wherein the server is compatible with the HyperText Transfer Protocol (HTTP) and wherein the media segment files resulting from encapsulating each of the created tracks are compatible with the Base Media File Format and the Dynamic Adaptive Streaming over HTTP format as defined by the International Standard Organization.
41. A device for providing a timed media data bit-stream from tiled timed media data encapsulated in a plurality of media segment files, in a client device, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the plurality of media segment files comprising at least one composite track and at least one tile track, each of the at least one tile track comprising one spatial subsample selected from amongst the plurality of spatial subsamples of one of the timed samples and comprising one corresponding spatial subsample of each of the other timed samples, the at least one composite track comprising at least one extractor identifying at least one tile track and identifying at least one spatial subsample of the identified tile track, the device comprising at least one microprocessor configured for carrying out the steps of: selecting an item of information representative of a spatial subsample; requesting the at least one tile track comprising the selected spatial subsample, the selected spatial subsample corresponding to the selected item of information; receiving the plurality of media segment files; and replacing extractors of the at least one composite track by corresponding identified spatial subsamples to generate the timed media data bit-stream.
42. The device of claim 41 wherein the microprocessor is further contigured for carrying out the steps of: obtaining a track reference from an extractor of the at least one composite track; checking whether or not the track corresponding to the obtained track reference has been received; and if the track corresponding to the obtained track reference has not been received, removing corresponding extractor to generate the timed media data bit-stream.
43. The device of claim 42 wherein the microprocessor is further configured for carrying out the step of verifying that the track corresponding to the obtained track reference is a track of tile type.
44. The device of claim 41 wherein the microprocessor is further contigured for carrying out the steps of: obtaining a track reference from an extractor of the at least one composite track; checking whether or not the track corresponding to the obtained track reference has been received; and if the track corresponding to the obtained track reference has not been received, replacing corresponding extractor by padding to generate the timed media data bit-stream.
45. The device of claim 44 wherein the microprocessor is further contigured for carrying out the step of verifying that the track corresponding to the obtained track reference is a track of tile type.
46. The device according to claim 41 wherein the microprocessor is further configured for carrying out the step of receiving an initialization segment file comprising parameters allowing formation of a spatial portion of the tiled timed media data from the created tile tracks and the created at least one composite track.
47 The device according to claim 41 wherein the microprocessor is further configured so that at least two media segment files representative of at least two tile tracks are received, the at least one composite track comprising at least one item of data that is to be used to decode any of two spatial subsamples belonging to the same set of spatial subsamples.
48. The device according to claim 41 wherein the tiled timed media data are scalable tiled timed media data according to which each of the timed samples comprises a reference layer and at least one enhancement layer, the reference layer comprising at least one reference spatial subsample and the at least one enhancement layer comprising a plurality of enhancement spatial subsamples, and wherein the microprocessor is further configured so that at least two media segment files representative of at least two tracks are received, one of the at least two received tracks being a reference track comprising reference spatial subsamples and the other of the at least two received tracks being an enhancement tile track comprising enhancement spatial subsamples.
49. The device of claim 41 wherein the client device is compatible with the HyperText Transfer Protocol (HTTP) and wherein the received media segment files are compatible with the Base Media File Format and the Dynamic Adaptive Streaming over HTTP format as defined by the International Standard Organization.
50. A video encoder comprising the device according to any one of claims 29 to 40.
51. A video decoder comprising the device according to any one of claims 41 to 49.
52. A method of encapsulating tiled timed media data in a server, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, substantially as hereinbefore described with reference to, and as shown in Figure 6a.
53. A method of providing a timed media data bit-stream from tiled timed media data encapsulated in a plurality of media segment files, in a client device, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the plurality of media segment files comprising at least one composite track and at least one tile track, each of the at least one tile track comprising one spatial subsample selected from amongst the plurality of spatial subsamples of one of the timed samples and comprising one corresponding spatial subsample of each of the other timed samples, the at least one composite track comprising at least one extractor identifying at least one tile track and identifying at least one spatial subsample of the identified tile track, substantially as hereinbefore described with reference to, and as shown in Figures 6b and 7.
54. A device for encapsulating tiled timed media data in a server, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, substantially as hereinbefore described with reference to, and as shown in Figure 8.
55. A device for providing a timed media data bit-stream from tiled timed media data encapsulated in a plurality of media segment files, in a client device, the tiled timed media data comprising timed samples, each timed sample comprising a plurality of spatial subsamples, the plurality of media segment files comprising one at least composite track and at least one tile track, each of the at least one tile track comprising one spatial subsample selected from amongst the plurality of spatial subsamples of one of the timed samples and comprising one corresponding spatial subsample of each of the other timed samples, the at least one composite track comprising at least one extractor identifying at least one tile track and identifying at least one spatial subsample of the identified tile track, substantially as hereinbefore described with reference to, and as shown in Figure 8.