GB2602642A

GB2602642A - Method and apparatus for encapsulating uncompressed video data into a file

Info

Publication number: GB2602642A
Application number: GB2100151.6A
Authority: GB
Inventors: Ouedraogo Naël; Denoual Franck; Maze Frédéric; Ruellan Hervé
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2022-07-13
Also published as: GB2602714B; GB2602714A; GB202100151D0; GB202109380D0; GB202113868D0

Abstract

Encapsulation of an uncompressed video sequence (‘uncv’), comprising uncompressed samples (510), in a media file (505): generating ‘generic description information’ (500), describing the video data and indicating that the video sequence is uncompressed; generating ‘sample description information’ (501), indicating a number of components used for encoding pixel information for at least one sample, and comprising, for at least one component, a component representation information indicating a pixel representation of the component; and embedding the ‘generic description information’ (500), the ‘sample description information’ (501), and the uncompressed video sequence (‘uncv’) in the media file (505). Decapsulation of a media file (505) comprising an uncompressed video sequence (‘uncv’), comprising uncompressed samples (510): obtaining ‘generic description information’ (500), describing the video data and indicating that the video sequence is uncompressed; obtaining ‘sample description information’ (501), indicating a number of components used for encoding pixel information for at least one sample, and comprising, for at least one component, a component representation information indicating a pixel representation of the component; and reading from the media file (505) the uncompressed video sequence (‘uncv’) based on the ‘generic description information’ (500) and the ‘sample description information’ (501).

Description

METHOD AND APPARATUS FOR ENCAPSULATING UNCOMPRESSED VIDEO DATA INTO A FILE

FIELD OF THE INVENTION

The present disclosure concerns a method and a device for encapsulating uncompressed video data into a file. It concerns more particularly the definition of descriptive metadata for the description of the uncompressed video data.

BACKGROUND OF INVENTION

Uncompressed video sequence can be organized in a huge variety of formats. A video sequence is typically a sequence of images, also called frames or samples. Each frame is an image typically defined as an array of elementary points called pixels. A frame is composed of one or several components. Each component gives some information on the values of the pixels of the frame. Each component is composed of a set of component values. A component may have a number of component values corresponding to the number of pixels in the frame. In some case, some component may have a number of component values different from the number of pixels in the frame. For example, a YUV frame may have a first component, corresponding to Y, and having one luminance value for each pixel of the frame. The YUV frame may have two components, corresponding to U and V, and having respectively one chrominance U value and one chrominance value V, for each block of four pixels in the frame. In this example, the component Y has a number of component values, corresponding to the number of pixels, four time greater than the number of component values of the components U and V. Some components may provide auxiliary data not directly directed to the pixel values of the frame. For example, some components may provide alpha values relative to a transparency associated with the pixels. Other components may provide depth information relative to the distance of the object from the camera for each pixel position. These auxiliary components may have a number of component values corresponding to the number of pixels in the frame, but this may not be the case.

The component values may be encoded according to various binary format. While the color components have typically component values encoded as unsigned integers, the auxiliary components may have component values encoded in any type of binary encoding like, for example, signed integers, floating point values, complex values, and so on. Even considering color components encoded as unsigned integers, the number of bits used for encoding each value may differ, corresponding to different bitdepths for the different components.

The different formats of uncompressed video sequences may also differ in the way they organize the different component values in sequence for a given frame. A first example consists in interleaving the different component values corresponding to a same pixel. For example, if a RGB frame is composed of three color components, R, G, and B, the component values may be organized per pixel as (Ri, G1, BO, (R2, G2, B2), , (Re, Gn, Be). This type of organization is called "packed". The same RGB frame may be organized by component. In this example, all the R values are followed by all the G values, which are followed by all the B values, as R1, R2, ..., Rn, G1, G2, ..., Gn, Bi, 82, B. This type of organization is called "planar". Hybrid organizations may be contemplated where some components use a packed organization, while others use a planar organization for a same frame. These two types of organization are called the pixel representation of the component.

Known file formats used to encapsulate uncompressed video sequences are based on a priori knowledge of a set of uncompressed video sequence formats. Each format defines the number and types of components, whether they use a packed or planar organization, the binary encoding of the component values and their number, and so on. The uncompressed video sequence format is identified with an identifier typically introduced in a header of the file. A parser, in order to be able to parse the file, is supposed to know the uncompressed video sequence format corresponding to this identifier and to parse the file accordingly.

Besides the obvious, reading, parsing and rendering of the video sequence, a parser may be expected to be able of basic manipulations on an uncompressed video sequence file. These basic manipulations comprise temporal sub-bitstream extraction, namely extraction of a sub-sequence of frames between two temporal values. These basic manipulations also comprises the extraction of some components of the uncompressed video sequence or the extraction of a spatial region of interest in the uncompressed video sequence. Advantageously, the encapsulation of uncompressed video sequence should ease these manipulations by a parser.

SUMMARY OF THE INVENTION

The present invention has been devised to address one or more of the foregoing concerns According to a first aspect of the invention there is provided a method of...

According to a first aspect of the invention there is provided a method of encapsulating an uncompressed video sequence in a media file, the uncompressed video sequence comprising uncompressed samples, wherein the method comprises: generating generic description information describing the video data and indicating that the video sequence is uncompressed; generating sample description information indicating a number of components used for encoding pixel information for at least one sample; wherein the sample description information comprises for at least one component a component representation information indicating a pixel representation of the component; and embedding the generic description information, the generic sample description information, and the uncompressed video sequence in the media file.

In an embodiment, the sample description information further comprises a number of component values for at least one component.

In an embodiment, the sample description information further comprises a component description information for describing at least a component, the sample description information comprising at least a number of components values and a component value length for the component.

In an embodiment, the component representation information and the component description information in the sample description information is provided for each component.

In an embodiment, the sample description information comprises for each component an indication indicating whether the component is the first component of a packing set, a packing set being a set of consecutive components with a pixel representation corresponding to a packed pixel representation.

In an embodiment, the component description further comprises: the component representation information as an indication whether the component has a packed pixel representation; and if the component has a packed pixel representation, one or more offset values between two consecutive component values.

In an embodiment, the component description information further comprises a vertical and horizontal sampling rate indicating the sampling rate of component values relatively to width and height of the sample for the component.

In an embodiment, the sample description box further comprises: a prefix description information comprising an indication whether the sample data comprises a prefix, and if the sample data comprises a prefix the size of the prefix; and a suffix description information comprising an indication whether the sample data comprises a suffix, and if the sample data comprises a suffix the size of the suffix.

In an embodiment, the prefix description information and the suffix description information further comprises an indication whether the prefix, respectively suffix, data should be skipped by a parser extracting the component.

In an embodiment, the component description information further comprises: a prefix description information comprising an indication whether the component data comprises a prefix, and if the component data comprises a prefix the size of the prefix; and a suffix description information comprising an indication whether the component data comprises a suffix, and if the component data comprises a suffix the size of the suffix.

In an embodiment, the generic description information further comprises for each component an indication whether the component is a coloured component.

In an embodiment, the generic description information further comprises for each component an indication whether the component is an essential component.

In an embodiment, the media file is compliant with the ISOBMFF standard, and wherein the generic description information is stored as a SampleEntry box or as a SampleGroupDescriptionEntry box.

According to another aspect of the invention there is provided a method of reading a media file comprising an uncompressed video sequence, the uncompressed video sequence comprising uncompressed samples, wherein the method comprises: obtaining from the media file a generic description information describing the video data and indicating that the video sequence is uncompressed; obtaining from the media file a sample description information indicating a number of components used for encoding pixel information for at least one sample; wherein the sample description information comprises for at least one component a component representation information indicating a pixel representation of the component; and reading from the media file the uncompressed video sequence based on the generic description information and the sample description information.

According to another aspect of the invention there is provided a computer program product for a programmable apparatus, the computer program product comprising a sequence of instructions for implementing a method according to the invention, when loaded into and executed by the programmable apparatus.

According to another aspect of the invention there is provided a computer-readable storage medium storing instructions of a computer program for implementing a method according to the invention.

According to another aspect of the invention there is provided a computer program which upon execution causes the method of the invention to be performed.

According to another aspect of the invention there is provided a device for encapsulating an uncompressed video sequence in a media file, the uncompressed video sequence comprising uncompressed samples, wherein the device comprises a processor configured for: generating generic description information describing the video data and indicating that the video sequence is uncompressed; generating sample description information indicating a number of components used for encoding pixel information for at least one sample; wherein the sample description information comprises for at least one component a component representation information indicating a pixel representation of the component; and embedding the generic description information, the generic sample description information, and the uncompressed video sequence in the media file.

According to another aspect of the invention there is provided a device for reading a media file comprising an uncompressed video sequence, the uncompressed video sequence comprising uncompressed samples, wherein the device comprises a processor configured for: obtaining from the media file a generic description information describing the video data and indicating that the video sequence is uncompressed; obtaining from the media file a sample description information indicating a number of components used for encoding pixel information for at least one sample; wherein the sample description information comprises for at least one component a component representation information indicating a pixel representation of the component; and reading from the media file the uncompressed video sequence based on the generic description information and the sample description information.

At least parts of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit", "module" or "system". Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.

Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible, non-transitory carrier medium may comprise a storage medium such as a floppy disk, a CD-ROM, a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which: Figure la, lb, and 1c illustrates several examples of bitstreams representing an uncompressed, also called raw, video sequence; Figure 2 illustrates an example of streaming media data system from a server to a client; Figure 3 illustrates an example of an encapsulation process according to embodiments of the invention; Figure 4 illustrates an example of the main steps of a parsing process according to an embodiment of the invention; Figure 5 illustrates an example of media file 505 according to an embodiment of the invention; Figure 6 is a schematic block diagram of a computing device for implementation of one or more embodiments of the invention.

DETAILED DESCRIPTION OF THE INVENTION

Figure la, lb, and lc illustrates several examples of bitstreams representing an uncompressed, also called raw, video sequence. An uncompressed video sequence is composed of one or more samples representing a particular instant of time. Each of this sample contains one or more components. Typically, each component represent information of a colour of pixel data. It can be for example, grayscale information for monochrome samples or RGB or YCbCr or YUV components for coloured image; it can also contains other information not necessary representing pixel information. For instance, one sample may contain alpha, transparency or/and depth information in separate components of the samples. More generally, the information provided in a component is a set of values that pertains to a particular timing in the uncompressed video information. Another example is timed metadata information describing the pixels samples.

The video sequence of Figure la is composed of a series of samples 100, 104 and 108 each comprising a number of components. The sample 100 contains a series of components 101, 102 and 103; the sample 104 contains the components 105, 106 and 107; and finally the sample 108 contains the components 109, 110 and 111. Different bitstream representations of the values contained in each component of each sample are employed in the state of the art. For example, the YUV 1420 format codes first all the component values of the Y channel followed by the values of the Cb component and then by the values of the Cr component. This kind of representation is often referred to as a planar representation: the values for a component are represented as a plane that is a set of consecutive bits in memory. For example, the Figure lb illustrates a bitstream 118 with a planar representation of the components of the video sequence of Figure la. The first byte range 112 of the bitstream 118 contains the values of the first component 101 of the first sample of the video sequence. The next range of bytes 113 contains the values of the second component 102 of the first frame followed by the byte range 114, which completes the sample with the values of the component 103. The same bitstream pattern is used for the following sample and the bitstream ends with the byte ranges 115, 116 and 117 wherein each byte range contains respectively the values of the components 109, 110, 111 of the last sample of the video sequence. Another representation of the values of the components in the bitstream is the packed representation. Instead of storing the component inside separate ranges of the bitstream, the values of the components are interleaved in memory. Typically, the bitstream contains first the first value of the first component followed by the first value of the second component and so on for all the components and for all the values of the components. As a result, the values corresponding to a pixel (i.e. the components representing one spatial sample) are represented inside a contiguous range in the bitstream. The number of values in each component may differ (typically sampling rate of chroma components could be lower than for the luma component). In such a case, the number of component values between two different values one of the same component is not necessarily a constant. Figure lc is an example of bitstream 133 using a packed representation for the components of the video sequence of figure la. This bitstream 133 contains series of bytes 119 to 132, which successively contains a value of each component of the samples of the video sequences. For instance for the first sample of the video sequence, the series of bytes 119 contains the first value of the component 101, and is followed by the byte series 120 that contains the first value of the component 102; and finally the byte series 121 contains the first value of the component 103. This pattern is repeatedly applied to each value of each component of the video sequence.

Thus, the byte series 123, 124 and 125 contain respectively the second value of the components 101, 102 and 103. The byte series 126 to 132 contain the equivalent information for the last sample of the video sequence.

In both the planar and packed representations, a component value may be stored on a non-integer number of bytes. For example a component value may be stored using 12 bits.

A sample is defined in this document as corresponding to all the data associated with a single time instant. Usually, two samples within a track do not share the same decoding time; and two samples don't share the same composition time. In non-hint tracks, a sample is, for example, an individual frame of video, a series of video frames in decoding order, or a compressed section of audio in decoding order; in hint tracks, a sample defines the formation of one or more streaming packets.

For convenience in the writing of the document, the words image or picture or frame refer to the sample of the video sequence even if the sample contains non-pixels information. Similarly, a pixel is considered as the set of values of all the components corresponding to a specific spatial location in one of the picture of the video sequence.

It comprises component values corresponding to colour information and component values corresponding to metadata (e.g. alpha channel, depth map etc.).

When encapsulating one or more video sequence in a media file, typically complying with the ISOBMFF standard, the video sequence are typically stored as tracks. This type of media file is composed of hierarchical embedded data structure, also called boxes. Video sequence media data are typically stored in a media data structure like the "mdat" boxes 118 and 133 in Figure lb and Figure lc. The media file typically further comprises descriptive metadata describing the different tracks.

Sample description is a descriptive metadata structure that defines and describes the format of some number of samples in a track.

A sample entry type is a four-character code that is either a format value of a SampleEntry metadata structure directly contained in SampleDescriptionBox metadata structure or a data_format value of OriginalFormatBox.

Figure 2 illustrates an example of streaming media data system from a server to a client. As illustrated, a server 200 comprises an encapsulation module 205, connected via a network interface (not represented), to a communication network 210 to which is also connected, via a network interface (not represented), to a de-encapsulation module 215 of a client 220.

The server 200 processes data, e.g. video and/or audio data, for streaming or for storage. To that end, the server 200 obtains or receives data comprising, for example, the recording of a scene by one or more cameras, referred to as a source video. The source video is received by the server as an original sequence of pictures 225. The server encodes the sequence of pictures into media data (i.e. bit-stream) using a media encoder (e.g. video encoder), not represented, and encapsulates the media data in one or more media files or media segments 230 using the encapsulation module 205. The encapsulation module 205 comprises at least one of a writer or a packager to encapsulate the media data. The media encoder may be implemented within encapsulation module 205 to encode received data or may be separated from encapsulation module 205.

The client 220 is used for processing data received from the communication network 210, for example for processing the media file 230. After the received data have been de-encapsulated in the de-encapsulation module 215, also known as a parser, the de-encapsulated data or parsed data, corresponding to a media data bit-stream, are decoded, forming, for example, audio and/or video data that may be stored, displayed or output. The media decoder may be implemented within de-encapsulation module 215 or it may be separate from de-encapsulation module 215. The media decoder may be configured to decode one or more video bit-streams in parallel.

It is noted that media file 230 may be communicated to the de-encapsulation module 215 in different ways. In particular, the encapsulation module 205 may generate the media file 230 with a media description (e.g. DASH MPD) and communicates (or streams) it directly to the de-encapsulation module 215 upon receiving a request from the client 220. The media file 230 may also be downloaded by and stored on the client 220.

For the sake of illustration, the media file 230 may encapsulate media data (e.g. encoded audio or video) into boxes according to ISO Base Media File Format (ISOBMFF, ISO/IEC 14496-12 and ISO/IEC 14496-15 standards). In such a case, the media file 230 may correspond to one or more media files, indicated by a FileTypeBox (ftyp', or one or more segment files, indicated by a SegmentTypeBox styp'. According to ISOBMFF, the media file 230 may include two kinds of boxes, a "media data box", identified as tridat or 'imda', containing the media data and "metadata boxes" (e.g. imoov' or 'moon containing metadata defining the location and timing of the media data. In a particular embodiment, the sequence of pictures 225 is encoded, or compressed, according to an uncompressed video format such as the YUV 1420 format. The encoding consists then in representing the sensor information in one or more set of components values with a predefined order. In addition, the encoder may encode audio source in the media bitstream.

Figure 3 illustrates an example of an encapsulation process according to embodiments of the invention. In a first step 300, the media file writer receives the video bitstream, for instance an uncompressed video sequence as represented with reference to Figure 1, to process. The encapsulation process consists in applying a processing loop for each sample of the input video sequence. In particular, it starts by determining the uncompressed video characteristics, or attributes, in a step 301. The characteristics of the uncompressed bitstream comprise information specifying the content of the video sequence such as the number of samples provided in the bitstream and the uncompressed format of the input video sequence. This information is provided within the bitstream as a predetermined byte sequence at the beginning of the input bitstream or can be provided as a configuration file of the file writer. In particular, the characteristics of the input video compressed comprise the number of components present within each picture and for each component the number of coded values. In addition, it is provided the coding format of each component, typically planar or packed, in addition to the coding pattern of component values used for each sample. Finally, the number of pictures or samples provided in the input bitstream can be provided or deduced from the length in bits or bytes of the input video sequences.

From the characteristics information of the input bitstream, the file writer deduces the byte or bit addresses of each sample in the bitstream in a step 302. This information is advantageously stored in memory for later use. Similarly, in a step 303, the file writer determines the start addresses of each component relatively to the start address of the sample and stores it in memory. Last, the determination step 304 associates spatial information with component values. Typically for coloured components, it is determined the pixel coordinates of each component value. Again, this information is stored in memory.

Then, the writer generates in a step 305, information on the structure of the uncompressed video stream that is signalled in different boxes of the ISOBMFF output file. This information is a generic representation of the main structures, typically, the samples, the components and the values of the components, of an uncompressed or raw video sequence. Several alternative embodiments to signal this information at different locations of the bitstream are proposed. Some of these embodiments are described with reference to Figure 5.

Finally, the process encapsulates the bitstream of the uncompressed video sequence in one or more tracks and the bytes forming the bitstream are stored in an Thdat box. Typically, when more than one tracks are used, the encapsulation process may, for example, encapsulate image components in one track and the remaining components in a different. According to another example, each component of the video sequence may be described in its own track.

Figure 4 illustrates an example of the main steps of a parsing process according to an embodiment of the invention. According to this embodiment, the media file contains information specifying a generic representation of an uncompressed video sequence coding structure.

As illustrated, a first step 400 is directed to initializing a media player to start reading a media file encapsulated according to some embodiments of the invention.

Next, the media player determines, in a step 401, if the media file content relates to uncompressed video format, typically by the presence of one or more boxes comprising information that is part of the generic representation of uncompressed video sequence as generated in a step 305 of the file writer process.

During step 402, the player determines the operating point requested for the playout. Typically, the operating point is information specifying the expected samples to be played. In addition, the operating point may select a subset of the components. In yet another alternative, it may also specify a predetermined spatial area of the media sample i.e. a zone in a picture for colour components. For non-colour component, the component values may also be associated to a spatial portion of the video.

Based on the information determined in step 401 and 402, the media player can identify the set of component values that should be extracted from the media file data, typically within the mdat container. In a step 403, the file parser determines from the descriptive metadata boxes that contain information specifying the generic representation of the uncompressed video sequence coding structure, the location, meaning the byte or bit address inside the cmdat container, of the requested component values.

The final stage 404 of the media player process consists in forming the reconstructed bitstream that corresponds to the selected operating point. In particular, the reconstructed bitstream may include a subset of the values of the original uncompressed video sequence Figure 5 illustrates an example of media file 505 according to an embodiment of the invention. In this example, the media file complies with the ISOBMFF standard. The media file 505 contains an uncompressed video sequence in an cmdat data structure 510 along with associated descriptive metadata in a 'trak' data structure 530. The Irak' data structure 530 comprises the description of several components of the uncompressed video sequence.

The sample description information 501 includes a generic description 500 of the samples and of the component data of the uncompressed video sequence described by the 'Irak' box 530. The generic description information 500 comprises several attributes according to embodiments of the invention. For example, it may contain information that allows a parser to extract a subpart of the bitstream of the uncompressed video data present in the data part of the file, i.e. in the rndat data structure.

According to a first embodiment, the generic description information specifying the main properties of uncompressed video data is represented by a dedicated box of the ISOBMFF file. The generic description information describes the video data, it indicates that the video sequence is uncompressed. For example, this generic description information may be provided in a Sample Entry box according to the following syntax: class UncompressedVideoSappleEntry extends VisualSappleEntryruncy') { UncompressedVideoConfiguration config; This new UncompressedVideoSampleEntry extends the VisualSampleEntry with the 'uncy' coding name. The presence of this box in the track description indicates to the parser at step 301 in Figure 3 that the media data of the track corresponds to one or more uncompressed video samples that are described by the included generic representation information.

In this example, the UncompressedVideoSampleEntry comprises a configuration box named UncompressedVideoConfiguration. This box comprises additional information that specifies the coding structure of the uncompressed video sequence. It indicates a number of components used for encoding pixel information for at least one sample. For example, the UncompressedVideoConfiguration may have the following syntax: class UncompressedVideoConEiguration extends Box(uncC') ( utiestring uncompressed video format; unsigned int(8) component count; unsigned int(1) component representation; With the following semantics: The uncompressed_videoformat field specifies the type of the coding format used for the uncompressed video sequence. The purpose of this field is purely informative and must not be used by the parser to determine the properties of the uncompressed video data format. In other words, the parser is not supposed to use the content of this field to determine how to parse the media data. Typically, the media file parser forwards this element to the application using the video data to select the appropriate handler to process the bitstream constructed by the parser. For instance, this element may comprise the 'YUV 1420' character string indicating that the sequence is a YUV video sequence using the 1420 chroma format. Another example is PUVF v2.0' that indicates a proprietary uncompressed video format in a 2.0 version.

The field component_count indicates the number of components in each sample of the video data.

The field component_representation indicates the pixel representation of the components. In this embodiment it is a flag indicating whether the representation is packed or planar. For example, a value '1' indicates that a planar representation is used, and a value '0' indicates that a packed representation is used.

In this embodiment, it is assumed that all the components have a same number of values and that these values have the same bit length. The number of component values is equal to the number of pixels of the sample, which is signaled in the VisualSampleEntry through the indication of the width and height of the sample.

In a variant of this embodiment, the number of values per component may be indicated in an additional syntax element. This variant allows describing an uncompressed video sequence with a sampling of the components coarser than the pixel granularity.

The component value_length variable is the length in bits of each component value and is equal to the size of the component (component size) divided by the number of component values (component_value_count).

As a result, the byte address of the component relatively to the address of the first byte of each sample is determined as follows. The length in bytes of the current sample is determined using the metadata inside the ISOBMFF. For example, it is determined with the help of a SampleSizeBox present in the descriptive metadata of the track. It is assumed in this embodiment that each component has the same length. The length in bits of each component is thus the size of the sample in bits divided by the number of components: com-oonent_size -sample size / component count The start address of each component depends on the representation format. If a planar representation is used (the component_representation field is equal to 1), the first value of the i-th component, using a 0-based index, is equal to the bit location of the j-th sample (for example defined by a sample_start_offset[j] variable) plus the product of i and component_size: comoonent start address[i] = sample start offset[j] + i A component size The parser may determine for example the bit (or byte) location of the j-th sample of the bitstream in a fragmented file from the TrackRunBoxes, or from the information in the SampleSizeBox, SampleToChunkBox and ChunkOffsetBox in a non-fragmented file.

The location of the n-th value of the i-th component in planar mode is made relatively to the address of the first value since the values of the component are contiguous in memory. The component_value_address[i][n] variable represents this location, wherein n is in the range from 0 to component_value_count-1 and i is in the range from 0 to component_count-1. It is equal to the address of the first value plus n times the length of the component value.

comoonent value address[i] [n] = component start address[i] + n comoonent_value_length If a packed representation is used (the component_representation field is equal 35 to 0), the n-th value of the i-th component is represented by component_value_address[i][n] with n in the range from 0 to component_value_count-1 and i in the range from 0 to component_count-1. It is equal to the bit location of the j-th sample (sample_start_offset[j]) plus the product of i and the length in bits of the component value plus n multiplied by the length in bits of the component value multiplied by the number of component in the sample.

comooneat value address[i] [n] = sample start ottset[j] + i*componenl value length + n * compere-al valae leagah * component__ coual The generic information provided in the UncompressedVideoConfiguration allows thus a parser to determine the location of each value of each component of each sample within the uncompressed video sequence. It can then perform sub bitstream extraction of values of one or more components with a specific time interval from the media file without a priori knowledge of the coding format in use for the uncompressed video sequence.

In one alternative, the SampleGroupDescription box specifies the generic information representing the uncompressed video sequence instead of a SampleEntry box. As a result, instead of specifying constant generic information for all the samples of the media file, the information can be dynamic and can vary from one group of samples to another. It enables to support uncompressed format with different component configurations for different samples of the video sequence.

In another alternative, the uncompressed_videoformat field is specified as an unsigned integer. Each coding format is associated to a predefined value. Possibly some values are reserved for proprietary formats. Possibly, a specific value may be reserved for signaling the presence of a supplementary field allowing to described proprietary of new coding formats. Possibly, the uncompressed_videoformat field may be an URN or URL. Note that these alternatives may also apply to other embodiments of the invention.

This first embodiment allows a very compact representation of the generic information while assuming some constraints on the format of the uncompressed video data. It is assumed that all the components have the same format, same size and same bit length per component value and the same number of values for each component.

In a second embodiment the generic information is completed to allow describing a different format for each component. For instance, the 1420 format uses a different number of component values for the luminance component Y than for the chrominance components Cr and Cb. In this embodiment, the UncompressedVideoConfiguration box is enriched to define a different generic representation for each component. The syntax may be as follows: class UncompressedVideoConfiguration extends Box(uncC') { utlestring uncompressed video _format; unsigned int(8) component count; unsigned int(1) component representation; for (i = 1; i <= component count; i++)1 UncompressedVideoSampleComponent() This new version of the box provides a signalling loop that indicates a component description information for at least one component, possibly for all components, which comprises at least a number of component values and a component value length for the component. The UncompressedVideoSampleComponent structure is a new ISOBMFF box that represents this information. The order of the components in this signalling is the same as the component order in the bitstream.

In one example, the UncompressedVideoSampleComponent may have the following syntax: class UncompressedVideoSampleComoonenL() ( unsigned int(32) component value count; unsigned int(8) component value length; With the following semantic: The component_value_count field indicates for the given component the number of component values in a sample.

The component_value_length indicates the number of bits of a component value.

These two fields allows describing an uncompressed video format where the different components may have different numbers of component values and different binary encodings of the component values. In a variant, the UncompressedVideoSampleComponent structure uses two fields to specify the component values bit length: the first field is the floor of the length value in bits divided by 8, meaning that this field gives a number of bytes. The second field is optional and is the remainder of the division of length value in bits by 8. For example, the first field is equal to 4 and the second equal to 0 for a length of 32 bits while the first field is equal to 4 and the second is equal to 3 for a length of 35 bits. This two-field representation can be more compact for high number of bits per component.

Based on these two additional fields in this second embodiment, the parser is in position to determine the address of each component in a planar representation: the total size in bits of the values of a component is equal to the number of its values, as provided by component_value_count, multiplied by the length of the component values, as provided by component_value_length in bits.

component_ size = comooneaL value count component value length This component_size value may be different for each component of the sample: the component_size[i] variable thus represents the value of component_size for the i-th component that is computed from the values of the i-th UncompressedVideoSampleComponent structure within the UncompressedVideoConfiguration box. Similarly, the variables component_value_count[i] and component_value_length[i] correspond to the values of component_value_count and component_value_length of the i-th UncompressedVideoSampleComponent structure within the UncompressedVideoConfiguration box.

This embodiment concerns in particular video sequences for which all the components use planar or packed representation. In other words, there is no component that use a planar representation while other components use a packed representation. The component location can be determined as follows by the media file parser.

The start address of each component depends on the representation format. If a planar representation is used, the component_size_cumul[i] variable is computed as the sum of the sizes of the components that are represented prior to the i-th component: comnonent_size_cumul[0] -0 tor (1=1; i < component count; i++) 1 componenL size cumul[i] = comoonenL size cumul[1-1]+ component_size[i-i] The first value of the i-th component in planar representation is equal to the bit location of the j-th sample (sample_start_offsetn plus the sum of the sizes of the component that is prior to the i-th component (component size_cumul[i]): comoonent start address[i] = samole star+ offset[j] + comooneaL size ctunul[i] The location of the n-th value of the i-th component is the component_value_addressM[n] variable wherein n is in the range from 0 to component_value_count-1 and i is in the range from 0 to component_count-1. In planar mode, this location is relative to the address of the first value since the values of a component are contiguous in memory. It is equal to the address of the first value plus n times the length of the component value.

comoonent value address[i][n] = comoonent start address [ii + n4component value length [ii If a packed representation is used, the component_stride variable is the length in bits taken by the set of packed component values for one pixel. It is equal to the sum of the lengths of component values. The component_value_length_cumul[i] is the sum of the lengths of the values of the components that have an index lower than the i-th component.

comoonent_stride -component_value_length[0] comnonent value length cumul[0] = 0 for (1=1; i < component count; i++) { component_stride +-component_value_length[i] component value length carnal [I] = component value length cumal [I-1]+ comoonent value length[1-1] Therefore, the address of n-th value of the i-th component is the component_value_address[i][n] variable, wherein n is in the range from 0 to component_value_count-1 and i is in the range from 0 to component_count-1. It is equal to the bit location of the j-th sample (sample_start_offset[j]) plus the sum of the lengths of the values of the components that have an index lower than the i-th component (component_value_length_comulM) plus n multiplied by the length in bits taken by the set of packed component values for one pixel.

comoonent value address[i][n]= sample start ottset[j] comoonent value length cumul[i] + nAcomoonent stride In a third embodiment, the information describing a generic representation is modified to allow specifying different configurations for each component with possibly different pixel representations. This embodiment addresses uncompressed video sequences where the components in planar or packed representations may be mixed without any constraint.

According to a first variant of the third embodiment, a syntax element is provided in the UncompressedVideoConfiguration box indicating the pixel representation of each component. The syntax of the UncompressedVideoConfiguration may be: class UncompressedVideoConfiguration extends Box(uncC') utf8string uncompressed video format; unsigned int(8) component count; for (i = 1; i <= component _count; i++)( UncompressedVideoSampleComponent() unsigned int(1) component_packing_type; bit(7) reserved; Where the syntax element component_packing_type is a flag indicating whether the pixel representation of the component is planar or packed. For example, the value 0 indicates a planar pixel representation, and the value 1 indicates a packed pixel representation. As a packed pixel representation means that the component is interleaved with at least another consecutive component, when a component is indicated as packed, the previous one or the next one must be also indicated as packed.

Hereinafter, the set of consecutive components indicated as packed is called a packing set.

The UncompressedVideoSampleComponent contains the following syntax elements with the same semantics as defined in previous embodiments.

class UncompressedVideoSampleComoonent() { -unsigned int(32) component_value_count; -unsigned int(8) component value length; All the consecutive components with a component_packing_type flag indicating a packed pixel representation are interleaved with the same order as signaled in UncompressedVideoConfiguration box. In addition, the same number of values is expected for two consecutive components with a component_packing_type flag indicating a packed pixel representation.

The variable component_start_addressM has the same meaning as in previous embodiments and is computed for each component of the sample. This variable is further modified below for the components with a packed representation.

comoonent start address[1] = sample start offset[j] + comoonent size cumul[1] The location of n-th value of the i-th component, is represented by component_value_address[i][n] variable, wherein n is in the range from 0 to component_value_count-1 and i is in the range from 0 to component_count-1. If the component_packing_type is equal to 0 (planar mode), the location is made relatively to the address of the first value since the values of the component are contiguous in memory. It is equal to the address of the first value plus n times the length of the component value.

component value address[i][n] = component start address [I] + n*component value length[i] If the component_packing_type indicates a packed pixel representation for the ith component, the component_stride variable is the length in bits taken by the set of consecutive packed component values for one pixel. It is equal to the sum of the lengths of component values. The component_value_length_cumul[i] is the sum of the lengths of the values of the components that have an index lower than the i-th component and are in the same packing set.

comoonent_packing_set_idx[0] -0 tor (1-1; i < component count; i++) I component_ packing set idx[i] = component_ packing set idx[i-l] If (component packing Lype[i] == 0)( component packing set idx[1] += 1 component_start_address[i] -component_start_address[i-l] If (component_packing_set_idx[i] component_packing_set_idx[i-1]) { component value length cumul[i] = component value length cumul [1-1] + component value length[i-l] component_stride[component_packing_set_idx[i]] +-comconent value length[i] else component value lengt cumul[i] = 0 Therefore, the following formula computes the address of n-th value of the i-th component (represented by component_value_address[i][n] with n in the range from 0 to component_value_count and i in the range from 0 to component_count-1): comoonent value address[i][n] -component start address[i] component value length cumul[i] + n*component stride[component packing set idH[1]] In an alternative, the UncompressedVideoConfiguration includes a component_new_packing_set flag that is equal to one when the i-th component starts a new packing set. This allows defining four components that are packed two by two.

class UncompressedVideoConfiguration extends Box(uncC') utf8string uncompressed video format; unsigned int(8) component count; for (i = 1; i <= component count; i++),( UncompressedVideoSampleComponent() unsigned int(1) component_packing type; unsigned int(1) component new_packing set; bit(6) reserved Instead of detecting that a new packing set starts when a component has component_packing_type indicating a planar pixel representation, it is based on the value of the component_new_packing_set. The algorithm above becomes: component packing seL idx[0] = 0 tor (1=1; i < component count; i++) component oacking set idx[i]= component packing set idx[i-1] it (component new packing_set[i] == 1)1 componentjoacking_set_idx[1] +-1 if (componenL packing seL id/[1] == componenL packing seL id/[1-1] ) component start address[i] = component sLarL address[1-1] componenL value lengLn cumul[i] = componenL value lengLh cumul [i-1] + component value length[1-1] component stride[component packing set idx[i]] += component valae length[i] else component_value_length_cumul[i] 3 t) In another alternative, the component_packing_type and component_new_packing_set fields are signalled in the UncompressedVideoSampleComponent structures signals instead of in UncompressedVideoConfiguration.

In another alternative, a code point information provided in the UncompressedVideoConfiguration infers the component packing_type and component_new_packing_set. This code point information of components representation is an unsigned integer for which each value represents a predetermined component configuration.

For instance, the code point equal to 0 indicates that all components have a component_packing_type inferred to be planar. A code point equal to 1 indicates that all components have a component_packing_type inferred to be packed. A code point equal to 3 indicates that the two first components have a component_packing_type inferred to be packed and that the other components have a component_packing_type inferred to be planar. A code point equal to 4 indicates that the three first components have component_packing_type inferred to be planar and that the other components have a component_packing_type inferred to be packed.

In a second variant of the third embodiment, it is assumed that the byte or bit offset between any two values of a given component is constant. This constant corresponds to the component value_length for a component in planar pixel representation, meaning that no further information is needed. For a component in packed pixel representation, the constant is signaled. The main difference with the previous embodiment is that the syntax element indicating the pixel representation is no more present at the level of the UncompressedVideoConfigurafion, which may use the following syntax: class UncompressedVideoConfiguration extends Box(uncC') utf8string uncompressed video format; unsigned int(8) component count; for (i = 1; i <= component count; i++)1 UncompressedVideoSampleComponent() The UncompressedVideoSampleComponent includes additional information. The number and the length of the values of the component are still present. Additional syntax elements provide information on how to parse the component values in the media data box. This representation makes it possible to describe hybrid pixel representation, i.e. when planar and packed representations are both used.

class UncompressedVideoSampleComoonent() { unsigned int(32) component_value_count; unsigned int(6) component value length; unsigned int(32) component first value address; unsigned inL(1) component packed flag; bit(7) reserved; if (component_oacked_flag --1) 1 unsigned int(16) component_next_value_offset; For each component, the UncompressedVideoSampleComponent box has the following semantics: * component_first_value_address is the address of the first value relatively to the sample address. That means that a value of 0 corresponds to the first bit of the sample in the media data box.

* component_packed_flag equal to 0 indicates that all the values of the component are contiguous in the media data box. When equal to 1, it states that the values are packed with values of one or more components.

* component next_value_offset is an offset in bits (in one alternative, in bytes) from the address of the n-th value of the component to obtain the address of the n+1-th value. Typically, this value is an unsigned integer on sixteen bits. We refer to this information as the next value offset. This value indicates an offset between two consecutive component values.

In the following, an array variable with the same name as the items above is defined for which the i-th item corresponds to the item with the same name in the i-th UncompressedVideoSampleComponent structure of the UncompressedVideoConfiguration.

The media file parser may determine the address of the n-th value of the i-th component from the generic information provided in previously mentioned boxes as follows. The algorithm consists in determining the address of the first value of the i-th component. It is equal to the address in bits of the sample plus the address in bits of the first component as provided in UncompressedVideoSampleComponent structure of the i-th component. Then, the position of the next value is equal to the sum of this address in bits plus the value of the component_next_value_offset of the same structure. The same pattern applies for subsequent values: the address of the n-th value is equal to the address of the n-1-th value plus the next value offset.

component value address[i][0]= sample start offset[f] + comoonent_first_valae_address[i] for (n=1; n < component value count[i]; n++) I nomponenl_ value adciress[i][a] = componenl_ value address[i][a 1] + comconent_next_value_offset[i] In a third variant of the third embodiment, the constraint of having a constant byte or bit offset between any two values of a given component is no more assumed. This means that these offsets values have to be signaled. In this alternative, the next value offset is an array that specifies the pattern of the next value offsets between two consecutive values of the component in the media data box order.

The UncompressedVideoSampleComponent box includes additional information with the following syntax for example: class UncompressedVideoSampleComoonent() { unsigned int(32) component value count; unsigned int(8) component value length; unsigned int(32) component first value address; unsigned int(1) component_packed_flag; bit(7) reserved; if (component_oacked_flag --1) 1 unsigned inL(32) componenL nexL valae offseL counL; for (k = 1; k <= comoonent next value offset count; k++){ unsigned int(16) component next value ottset[k];

F

The media file parser may determine the address of the n-th value of the i-th component from the generic information provided in previously mentioned boxes as follows. The algorithm consists in determining the address of the first value of the i-th component that is equal to the address in bits of the sample plus the address in bits of the first component as provided in the UncompressedVideoSampleComponent structure of the i-th component. Then, the position of the next value is equal to the sum of this address in bits plus the value of the first value of the component_next_value_offset[0] of the same structure. The same pattern applies for subsequent values: the address of the n-th value is equal to the address of the n-1-th value plus the k-th next value offset with k equal to n modulo the component_next_value_offset count value. This algorithm is equivalent to the following pseudo code: k = 0; comoonenL valae address[i][0]= sample starL offset[ + comoonent first value address[i] for (n-1; n < component_value_count[i]; n++) 1 component value address[i][n] = component value address[i] [n-1] + comoonent_next_value_offset[i][k++] it (k /-component_next_value_offset_count) I k = 0;

F

In all the previous embodiments and their variants, the order of the component values for a given component was assumed to be the raster scan order of the picture. Typically, the raster scan order corresponds to the common scanning order of the picture starting on the top left of the picture and then going through each row of the picture from the left to the right starting from the topmost row. However, some sensors may provide data for which the starting point is the edge at the bottom right of the picture and rows are processed from right to left starting from the bottommost row. In an alternative of the two previous variant, the representation order of the components may be different from this raster scan order.

To describe these kinds of processing orders, the next offset values may be signed integers. As a result, the offset can be negative which makes it possible to scan the component from the media data box in a backward scan order inside the bitstream. The component_first_value_address then indicates the first value of the component in the raster scan order. For example, let's consider an uncompressed video containing a single component. The values of the component are provided starting from the bottom right edge of the picture. As a result, the first value in the media data box is the last value (i.e. the bottom right pixel) in processing order; and the last value in the media data box is the first value (i.e. the top left pixel) in processing order. In such a case, the component_first_value_address point to the last value in the mdat and the component_next_value_offset_count field indicates a negative offset value equal to minus the length of the value in bits.

In an alternative, the values of the components are not described in raster scan order of picture but rather in different zones. The scan of the components values follows the signaling order of the zone and within each zone, it is in the raster scan order of the zone. The number of zones used for one component are typically signaled in the generic information, i.e., typically in the UncompressedVideoSampleComponent structure. For each zone, the coordinates of the top left and bottom right corners of the zone are provided. The media file parser may thus determine to which zone the n-th value of a component belong to and therefore may determine the coordinates of the value relatively to this zone.

In an alternative, the rows or group of rows of the sample may be interleaved. For example, the bitstream first contains the values corresponding to the even rows, then the values corresponding to the odd rows. The signalling describes how the rows are split into different groups. For example, it may be signalled that a first group contains all the even rows and a second group contains all the odd rows. The signalling may also describe whether a group of rows may be shared by several samples or not. In a first example, the data corresponding to a first sample may be split into a first group of even rows and a first group of odd rows, and the data corresponding to a second sample may be split into a second group of even rows and a second group of odd rows. In another example, the data corresponding to a first sample may be split into a first group of even rows and a first group of odd rows, and the data corresponding to a second sample may be split into a second group of even rows and the first group of odd rows.

In a fourth embodiment, the generic description information that describes the uncompressed video sequence further defines information that facilitates the extraction of a subset of component values within an area of the picture.

In order to extract a set of values corresponding to a given area, the file parser needs information that permits to determine the coordinates of each value within the picture.

For an uncompressed video sequence that uses as many component values as pixels the association is quite straightforward: indeed the first value of each component is inferred at the coordinates (0, 0). And the N following values corresponds to the value on the first row with N equal to the picture width.

However, if the sampling rate of a component is different from the number of pixels, it is not possible to infer the location of the component value from the picture width and height only. For such kinds of uncompressed video sequences, the generic information further includes information allowing to determine the spatial position of each component values.

For example, the UncompressedVideoSampleComponent structure defines the vertical_sampling_rate and horizontal_sampling_rate syntax elements. The values of these items indicate the sampling rate of each component relatively to width and height of the picture. Typically, a horizontal sampling rate of 1 would indicate that within a row of the picture, a component value is provided for each pixel; while a horizontal sampling rate equal to 2 would indicate that a one component value is available every two pixels within a row of the picture. Same principle applies for vertical sampling rate on the column of the picture.

class UncompressedVideoSampleComoonent() f unsigned inL(8) veil:foal sampling rate; unsigned inc (8) horizontal sampling rate; The media file parser determines the x-axis and y-axis coordinates in the picture of the n-th value of the i-th component as follows: comoonent_value_x[i][n] - ceil(samole_width / horizontal_sampling_rate[i]))* horizontal_sampling_rate[i] component value y[i][n] = floor(n / ceil(sample width / norizontal_sampling_rate[i]) * vertical_sampling_rate[i] The horizontal_sampling_rate[i] and vertical_sampling_rate[i] corresponds the values of horizontal_sampling_rate and vertical_sampling_rate of the i-th UncompresedVideosampleComponent structure within the UncompressedVideoConfiguration box.

The value of component_value_x[i][n] ranges from 0 to sample_width-1 and the value of component_value_y[i][n] ranges from 0 to sample_height-1.

The additional information provided in the generic description information allows to perform sub bitstream extraction of a spatial region within the picture.

In a variant of this embodiment, the spatial sampling position of each component value within a sample/picture is indicated using a bit matrix. This bit matrix describes these spatial sampling positions as a pattern. This enables to specify the spatial position of component values with a semi-regular sampling rate or with a sampling rate that varies for at least two row of pixels of the sample/picture. For example, it may be used to specify the spatial position of component values for an image generated by a sensor using a Bayer filter.

For example, the UncompressedVideoSampleComponent structure may contain the following additional fields: class UncompressedVideoSampleComoonent() ( unsigned int(8) matrix columns; dasigned inL(8) maLrix rows; for (inL k=0; k < maLrix columns matrix rows; k++) unsigned inL(1) maLrix oresenL; The matrix_columns field indicates the number of columns in the bit matrix. For example in the case of a typical Bayer filter, the number of columns is 2.

The matrix_rows field indicates the number of columns in the bit matrix. For example in the case of a typical Bayer filter, the number of rows is 2.

Then for each element of the bit matrix, the matrix_present flag indicates if the component defines a value for the corresponding spatial location. The pattern described by the bit matrix is repeated horizontally and vertically to allow determining whether the component data defines a value at any given location.

For example, for the YUV 1420 format, the bit matrix may have 2 rows and 4 columns. For the Y component, all the matrix_present flags are set to 1 to indicate that this component defines a value for all the pixels at the corresponding position in the picture. For the U and V components, only the first and third elements of the first row have the matrix_present flag set to 1.

In another example, the size of the bit matrix corresponds to the size of the sample/picture. Thus, it may cover the whole image area to represent irregular sampling pattern.

The media file parser may determine the x-axis (represented by component_value_x[i][n] variable) and y-axis (represented by component_value_y[i][n] variable) coordinates in the picture of the n-th value of the i-th component using the following algorithm, where matrix[i][c, r] is the value of matrix_present field at column c and row r in the bit matrix of the i-th component.

matrix column -0; matrix row = 3; x = 0; v = 0; n = 0; while (y sample height) i if (matrix[i][matrix_column, matrix row] 1) { component value x[i] [n] H; component_value_v[i] [n] -Y; n++; malifix_column++; if (matrix col join == matrix columas[i]) matrix column 0; matrix row++; if (matrix row == matrix rows [i] ) matrix_row++; x++; if (x == sample widLn) x = 0; y++; In a variant, a single matrix is signalled and applies to all the components of the sample. This combined matrix may be stored in the UncompressedVideoConfigurafion box.

In a variant, a single matrix is signaled and each value of the matrix is a field of bits, with one bit associated to each component of the sample. This combined matrix may be stored in the UncompressedVideoConfiguration box.

In another variant of the fourth embodiment, the vertical_sampling_rate and/or the horizontal_sampling_rate may be array of integers, which allows representing a non-regular sampling rate. For example, the horizontal_sampling_rate may be the array [2, 3], signalling that the sampling pattern is the following: the component data defines a 20 25 30 35 value for the first and third pixels of a pixel row and doesn't define a value for the the second, fourth and fifth pixel of the same raw. Then, the same pattern applies for the subsequent pixels of the row: the component data defines a value for the sixth and eighth pixels of the row but not for the seventh, ninth and tenth pixels of the row.

Alternatively, the vertical_sampling_rate and/or the horizontal_sampling_rate may be float values or fixed decimal values to represent non-integer sampling rates. In a variant, an offset is associated to the vertical_sampling_rate and/or the horizontal_sampling_rate allowing the sampling for the component to start at a different position than the first row or first column of the picture. For example, an horizontal_sampling_rate value of 2 with an associated offset of 1 signal that the horizontal sampling starts on the second column and continues every other column, i.e., the fourth, the sixth...

Alternatively, a list of horizontal_sampling_rates and possibly of horizontal offsets is signalled for a component, allowing different horizontal sampling patterns for different rows. These horizontal_sampling_rate values and horizontal offsets are used in turn and repeated until the last row. For example if the vertical_sampling_rate is 2 and the horizontal_sampling_rate values are 2 and 3, the component data defines a value for the first and third pixels in the first row and so on for the following pixels in the row. In the second row, the component data defines no values for any pixels. In the third row, it is defined values for the pixels numbered first, fourth, seventh and so on. In the fourth row, no values are defined for any pixels. Then for the fifth row, the first horizontal_sampling_rate value is used again.

Alternatively, the vertical_sampling_rate and/or the horizontal_sampling_rate may be a list of values allowing for a variable sampling rate between different frames.

Possibly, several bit matrices may be signalled to allow a variable sampling pattern between different frames.

The features of all these alternatives may also be mixed.

In all the previous embodiments, is was assumed that the media data box comprises only the raw video data. It may happen that uncompressed video sequences comprise additional information describing the content of the sample or of a component. This additional information is typically metadata e.g. description of the sample content or proprietary information describing the model of the sensor and is at different location in the bitstream. Typically, this additional information may be present at the beginning of a sample or a component. It can also be located after the data corresponding to the sample or to the component.

In a fifth embodiment, the generic description information includes information indicating the presence and the size of a prefix, and/or suffix at the level of the sample and/or at the level of a component.

For the information at sample level, the UncompressedVideoConfigurafion box may signal the characteristics of the prefix and or suffix, for example, with the following syntax: class dncompressedVideoConfiguration extends Box(uncC') 1 unsigned int(1) sample _prefix; if (sample prefix --1) 1 unsigned int(1) sample_skip_pretix; bit(7) reserved; unsigned int(32) sample oretix size; 1 else bit(7) reserved unsigned int(i) sample suffix if (sample suffix == 1) [ unsigned int(1) sample_skip_suffix; bit(7) reserved; unsigned inL(32) sample suffix size;

F

else I biL(7) reserved to i = 1; i <= component count; i++){ UncomoressedVideoSampleComponent()

F

The UncompressesVideoSampleComponent box may define the prefix and suffix at the component level with the following syntax: class UncompressecNideoSampleComoonent() unsigned int(1) component_prefix; if (component prefix == 1) [ unsigned int(1) component skip preti; bit(7) reserved; unsigned int(32) component prefix size; else 1 bit(7) reserved unsigned inL(1) componenL suffix if (componenL suffix == 1) { unsigned inL(1) componenL skip suffix; bit(7) reserved; unsigned inL(32) component suffix size; else bit(7) reserved The syntax of the prefix and suffix at the sample or component level uses the same elements: * A flag indicates if the suffix (or prefix) is present at the concerned level O component_suffix (or component_prefix) for the components 0 sample_suffix (or sample_prefix) for the sample * When the flag is equal to 1, the size in bits (or bytes) of the suffix (or prefix) is provided as an unsigned integer.

O component_suffix_size (or component_prefix_size) for the components O sampka_suffix_size (orsarnple_prefix_sze)forthe sample When the flag is equal to 0, there is no suffix (or prefix) and the size of the suffix (or prefix) is inferred to be equal to O. In an alternative, represented in the above example, a flagindicateswhetherthe parser should skip the data of the prefix (or sufifix)vvhen extracting the component.

* component_skip_suffix (or component_skip_prefix) for the components * sampha_skip_suffix(or sample_skip_prefix)forthe sample A media file parser has to offset the start address of the sample or component when a prefix is present in the sample. Forinstance,for embodiment 1 with prefix and suffix information at the sample level the following algorithm may be used when the component representation is planar. The size of the component determined from the sample size is therefore decreased by the size of the prefix and suffix.

component size -(sample size * 8 -samole_prefix_size -sample_suffix_size) / component count When components also include suffixes and/or prefixes their sizes must be subtracted from the sample_size value to compute the component_sizes.

The start address of each component depends on the representation format. If a planar representation is used, the first value of the i-th component is equal to the bit location of the j-th sample (sample_statoffset[j]) plus the product of i and of the component_size: The start address of the i-th component takes into account the size of the sample prefix and may be computed according to: component_start_address[i] -samole_start_offset[j] + samole_prefix_size + I * comoonenLsize When components also include suffixes and prefixes, the suffix and prefix sizes of the component prior to the current component must be added to the component size start address, plus the prefix size of the current component.

The same principle applies also for packed representations. For some of the other embodiments, since the address of the first value of the component and the offset are made relatively to the start address of the sample, the media sample has no need to offset the address to take into account the presence of prefix and/or suffix information.

In a variant of all the previous embodiments, a flag is present in the generic description information indicating whether the component is a colored component or not.

This makes it possible for the file parser to determine and to extract only the components that encode image data. Therefore, the parser is able to ignore components that define other types of data such as transparency information for example.

In another variant of all previous embodiments, a flag is present in the generic description information indicating whether the component is essential (meaning that the parser must extract systematically the component by default) or if it is supplemental or optional (meaning that the parse may ignore the component unless explicitly requested). For instance, in an RGB sequence with transparency component, the R, G and B components may be set as essential while the transparency component is supplemental.

In all the variants, the size of the different fields may be different. They also may depend on one or more flags for allowing variable field sizes.

In all the variants, some fields present in the

UncrompressedVideoSampleComponent box may be moved into the UncompressedVideoConfiguration box, enforcing the same value for all the components. Possibly a default value may be provided in the UncompressedVideoConfiguration box with an optional field to override it in the UncrompressedVideoSampleComponent box.

The described embodiments may be combined if necessary according to the kind of constraints assumed in the format of the uncompressed video sequence Figure 6 is a schematic block diagram of a computing device 600 for implementation of one or more embodiments of the invention. The computing device 600 may be a device such as a micro-computer, a workstation or a light portable device. The computing device 600 comprises a communication bus connected to: -a central processing unit 601, such as a microprocessor, denoted CPU; -a random access memory 602, denoted RAM, for storing the executable code of the method of embodiments of the invention as well as the registers adapted to record variables and parameters necessary for implementing the method according to embodiments of the invention, the memory capacity thereof can be expanded by an optional RAM connected to an expansion port for example; -a read only memory 603, denoted ROM, for storing computer programs for implementing embodiments of the invention; -a network interface 604 is typically connected to a communication network over which digital data to be processed are transmitted or received. The network interface 604 can be a single network interface, or composed of a set of different network interfaces (for instance wired and wireless interfaces, or different kinds of wired or wireless interfaces). Data packets are written to the network interface for transmission or are read from the network interface for reception under the control of the software application running in the CPU 601; -a graphical user interface 605 may be used for receiving inputs from a user or to display information to a user; -a hard disk 606 denoted HD may be provided as a mass storage device; -an I/O module 607 may be used for receiving/sending data from/to external devices such as a video source or display.

The executable code may be stored either in read only memory 603, on the hard disk 606 or on a removable digital medium such as for example a disk. According to a variant, the executable code of the programs can be received by means of a communication network, via the network interface 604, in order to be stored in one of the storage means of the communication device 600, such as the hard disk 606, before being executed.

The central processing unit 601 is adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to embodiments of the invention, which instructions are stored in one of the aforementioned storage means. After powering on, the CPU 601 is capable of executing instructions from main RAM memory 602 relating to a software application after those instructions have been loaded from the program ROM 603 or the hard-disc (HD) 606 for example. Such a software application, when executed by the CPU 601, causes the steps of the flowcharts of the invention to be performed.

Any step of the algorithms of the invention may be implemented in software by execution of a set of instructions or program by a programmable computing machine, such as a PC ("Personal Computer"), a DSP ("Digital Signal Processor") or a microcontroller; or else implemented in hardware by a machine or a dedicated component, such as an FPGA ("Field-Programmable Gate Array") or an ASIC ("Application-Specific Integrated Circuit").

Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a skilled person in the art which lie within the scope of the present invention.

Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular the different features from different embodiments may be interchanged, where appropriate.

Each of the embodiments of the invention described above can be implemented solely or as a combination of a plurality of the embodiments. Also, features from different 25 embodiments can be combined where necessary or where the combination of elements or features from individual embodiments in a single embodiment is beneficial.

In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used.

Claims

CLAIMS1 A method of encapsulating an uncompressed video sequence in a media file, the uncompressed video sequence comprising uncompressed samples, wherein the method comprises: generating generic description information describing the video data and indicating that the video sequence is uncompressed; generating sample description information indicating a number of components used for encoding pixel information for at least one sample; wherein the sample description information comprises for at least one component a component representation information indicating a pixel representation of the component; and embedding the generic description information, the generic sample description information, and the uncompressed video sequence in the media file.
2. The method of claim 1, wherein the sample description information further comprises a number of component values for at least one component.
3 The method of claim 1, wherein the sample description information further comprises a component description information for describing at least a component, the sample description information comprising at least a number of components values and a component value length for the component.
4 The method of claim 3, wherein the component representation information and the component description information in the sample description information is provided for each component.
The method of claim 4, wherein the sample description information comprises for each component an indication indicating whether the component is the first component of a packing set, a packing set being a set of consecutive components with a pixel representation corresponding to a packed pixel representation.
6. The method of claim 3, wherein the component description further comprises: the component representation information as an indication whether the component has a packed pixel representation; and if the component has a packed pixel representation, one or more offset values between two consecutive component values.
7 The method of claim 3, wherein the component description information further comprises a vertical and horizontal sampling rate indicating the sampling rate of component values relatively to width and height of the sample for the component.
8 The method of claim 1, wherein the sample description box further comprises: a prefix description information comprising an indication whether the sample data comprises a prefix, and if the sample data comprises a prefix the size of the prefix; and a suffix description information comprising an indication whether the sample data comprises a suffix, and if the sample data comprises a suffix the size of the suffix.
9 The method of claim 8, wherein the prefix description information and the suffix description information further comprises an indication whether the prefix, respectively suffix, data should be skipped by a parser extracting the component.
The method of claim 3, wherein the component description information further comprises: a prefix description information comprising an indication whether the component data comprises a prefix, and if the component data comprises a prefix the size of the prefix; and a suffix description information comprising an indication whether the component data comprises a suffix, and if the component data comprises a suffix the size of the suffix.
11. The method of claim 10, wherein the prefix description information and the suffix description information further comprises an indication whether the prefix, respectively suffix, data should be skipped by a parser extracting the component.
12. The method of claim 1, wherein the generic description information further comprises for each component an indication whether the component is a coloured component.
13. The method of claim 1, wherein the generic description information further comprises for each component an indication whether the component is an essential component.
14. The method of claim 1, wherein the media file is compliant with the ISOBMFF standard, and wherein the generic description information is stored as a SampleEntry box or as a SampleGroupDescriptionEntry box.
A method of reading a media file comprising an uncompressed video sequence, the uncompressed video sequence comprising uncompressed samples, wherein the method comprises: obtaining from the media file a generic description information describing the video data and indicating that the video sequence is uncompressed; obtaining from the media file a sample description information indicating a number of components used for encoding pixel information for at least one sample; wherein the sample description information comprises for at least one component a component representation information indicating a pixel representation of the component; and reading from the media file the uncompressed video sequence based on the generic description information and the sample description information.
16. A computer program product for a programmable apparatus, the computer program product comprising a sequence of instructions for implementing a method according to any one of claims ito 15, when loaded into and executed by the programmable apparatus.
17. A computer-readable storage medium storing instructions of a computer program for implementing a method according to any one of claims 1 to 15.
18. A computer program which upon execution causes the method of any one of claims 1 to 15 to be performed.
19 A device for encapsulating an uncompressed video sequence in a media file, the uncompressed video sequence comprising uncompressed samples, wherein the device comprises a processor configured for: generating generic description information describing the video data and indicating that the video sequence is uncompressed; generating sample description information indicating a number of components used for encoding pixel information for at least one sample; wherein the sample description information comprises for at least one component a component representation information indicating a pixel representation of the component; and embedding the generic description information, the generic sample description information, and the uncompressed video sequence in the media file.A device for reading a media file comprising an uncompressed video sequence, the uncompressed video sequence comprising uncompressed samples, wherein the device comprises a processor configured for: obtaining from the media file a generic description information describing the video data and indicating that the video sequence is uncompressed; obtaining from the media file a sample description information indicating a number of components used for encoding pixel information for at least one sample; wherein the sample description information comprises for at least one component a component representation information indicating a pixel representation of the component; and reading from the media file the uncompressed video sequence based on the generic description information and the sample description information.