WO2014033504A1

WO2014033504A1 - Method and system for segmenting and separately package audio, video, subtitle and metadata

Info

Publication number: WO2014033504A1
Application number: PCT/IB2012/054532
Authority: WO
Inventors: Sagara WIJETUNGA
Original assignee: Wijetunga Sagara
Priority date: 2012-09-03
Filing date: 2012-09-03
Publication date: 2014-03-06

Abstract

A content-aware method of segmenting and packaging audio, video, subtitle and metadata is disclosed here. The method take input from one or more input sources where the audio, video, subtitle and metadata are encapsulated into one or more container formats or the method reads raw audio data from an audio capturing device/system such as a microphone and/or reads raw video data from a video capturing device/system such as an image sensor or a video camera system. The method extracts audio, video and subtitle data together with their timing information from the stored container or in the case of the method reads raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, optionally encode audio and/or video data and package them separately as audio data into one or more audio chunks, video data into one or more video chunks, subtitle data into one or more subtitle chunks and metadata into one or more metadata files, ready and intended for final delivery. Chunks output by the method may be kept as separate individual files or be packed into a single large file.

Description

METHOD AND SYSTEM FOR SEGMENTING AND SEPARATELY

PACKAGE AUDIO, VIDEO, SUBTITLE AND METADATA

DESCRIPTION

The invention described here, by way of example only, with reference to the accompanying drawings. In th regard, no attempt is made to show structural details of the invention in more detail than is necessary for fundamental understanding of the invention.

[Dl] FIELD AND BACKGROUND OF THE INVENTION

This invention is in the fields of content segmentation and packaging audio visual data. Its method of segmentation is content-aware segmentation. It packages in term of chunks, in final form, that is, ready to be consumed or played by a compatible audio visual player.

Present method of packaging audio, video and subtitle data, is to either (1) encapsulate the audio, video and subtitle data together with timing or synchronisation data into a container format such as a VOB (Video Object), AVI (Audio Video Interleaved, the standard Microsoft Windows container), MPEG program stream, MP4 (standard audio and video container for the MPEG-4 multimedia portfolio), MKV (Matroska), WebM, etc. or (2) encapsulate the audio, video and subtitle data together with timing or synchronisation data into a transport stream such as MPEG transport stream (MPEG-TS, MTS or TS), MPEG-2 Transport Stream (M2TS), etc. Note, not all container formats support subtitles. Refer FIG. 01.

Present method of packaging audio, video and subtitle data normally results a single file, and depending on the resolution, duration, etc., generates a small file (eg. 5MB, 10MB, etc.) to a very large file (eg. 5GB, 10GB, etc.) in case of high definition video content. It can be even bigger size for 3D high definition video content with multiple left and right video streams with multiple high quality audio channels.

Invention disclosed here can generate thousands of chunks for an one hour of high definition video.

[D2] SUMMARY OF THE INVENTION

The claimed invention (i.e. Claim 1) is a content-aware method of segmenting the audio visual input source to the system and separately package audio, video, subtitle and metadata as audio data into one or more audio chunks, video data into one or more video chunks, subtitle data into one or more subtitle chunks and metadata into one or more metadata files, ready and intended for final delivery.

The content-aware method of segmenting the audio visual input source to the system and separately package audio, video, subtitle and metadata has three (3) broad functions: (1) Input reading (2)

Segmentation Process and (3) Packaging audio visual data into chunks.

The above (1) Input reading: Input source is read in terms of data packets as per " [D4.3] Input source reading rules". A data packet contains: (a) one or more audio frames or (b) one video frame or (c) other data in between valid frames depending on codecs used. The system does not decode or uncompress the audio visual data. As per audio frames and video frames, codecs being used for audio and video define what is a frame or an access unit is.

The above (2) Segmentation Process: The segmentation is content-aware. The data packets read are separated according to its stream. Data packets are queued to First-In-First-Out (FIFO) queues (or to any other structure where it could be read later in FIFO order). When a FIFO queue/structure exceeds its relevant chunk size, the system triggers the packaging. Refer FIG. 02.

The above (3) Packaging audio visual data into chunks: The packaging is done as per the Claim 4 (A method of packaging data with single metadata file), Refer FIG. 03 or as per the Claim 5 (A method of packaging data with multiple metadata files), Refer FIG. 04. One of these packaging methods could be used to package data.

As per the Claim 4 (A method of packaging data with single metadata file), system generates only one metadata file, known as the master metadata file. Refer FIG. 03.

As per the Claim 5 (A method of packaging data with multiple metadata files), system generates multiple metadata files. The master metadata file will always get generated. Refer FIG. 04.

A suitable player which is aware of the data packaging protocol, that is, whether the audio visual data are packaged as per the Claim 4 (A method of packaging data with single metadata file) or as per the Claim 5 (A method of packaging data with multiple metadata files), can reconstruct audio visual data with reference to the master metadata file and play in synchronisation.

As per the innovation disclosed, segment audio visual content into smaller chunks using a content-aware segmentation method, this innovation is useful in content sharing over peer-to-peer (p2p) networks.

[D3] BRIEF DESCRIPTION OF THE DRAWINGS

With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention, and are presented in the context of the most useful and readily understood description of the principles and conceptual aspects of the invention.

FIG. 1 is a diagrammatic representation of the present audio visual data packaging methods. It is provided for the purpose of readily visually show the difference of the invention disclosed here compared to existing methods.

FIG. 2 is a diagrammatic representation of the Method of segmenting audio visual data.

FIG. 3 is a diagrammatic representation of packaging audio visual data with single metadata file.

FIG. 4 is a diagrammatic representation of packaging audio visual data with multiple metadata files.

[D4] DETAILS OF CARRY OUT THE INNOVATION

[D4.1] Input to the system

The input to the system can be from one or more input sources. Input sources can be of (1) files or (2) transport streams, or (3) read raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, or (4) any combination (1), (2) and (3).

Example: (a) It can be a single file where audio, video, subtitle and metadata are encapsulated into a container format such as AVI (Audio Video Interleaved, the standard Microsoft Windows container), MKV (Matroska), MP4 (standard audio and video container for the MPEG-4 multimedia portfolio), etc., (b) Transport streams, such as MPEG-2 transport stream (a.k.a. M2TS), etc., where audio, video, subtitle and metadata are encapsulated into a particular container format, (c) Direct reading from a microphone and/or image sensor or video camera, etc.

[D4.2] Input reading

Input source is read in terms of data packets as per "[D4.3] Input source reading rules".

The audio, video and subtitle data are either compressed or raw data, (1) extracted in terms of data packets suitable for the purpose together with timing and other information related to the said data packet when reading from a container, or (2) when reading raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, and prepare data packets suitable for the purpose with timing and other information related to the said data packet.

The system does not decode the data before package. The system may optionally encode or compress audio visual data when it reads raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system before package.

Data packets can be presented for the content segmentation function of the system in encoded order (i.e. bitstream order) or in display order. The display order is preferred. The system may be configured to adjust/calculate the presentation time stamps and presentation durations where necessary. [D4.3] Input source reading rules

Input source reading is done by a function as per following rules:

1. Reads the next data packet of a stream. Streams are audio, video and subtitle streams.

2. The data packet has a defined data structure. PTS (Presentation Time Stamp), DTS (Decompression Time Stamp), Data/Payload (a pointer to data/payload), Data Length, streamjndex (ie. which stream, eg., audio, video or subtitle), Flags, Presentation Duration (Duration of this packet) are essential fields. These fields are defined under [D4.12] Definitions.

3. The returned data packet may not always contain a valid frame or frames. It should also return data in between valid frames. Objective is to provide more data for the decoder.

4. For video, the data packet contains exactly one video frame.

5. For audio, multiple audio frames per data packet if audio frames are fixed size, if audio frames are variable size, one audio frame per data packet.

6. Returns data packets in display order.

7. Video PTS, Video DTS and Presentation Duration are in video timebase units. Audio PTS and Audio DTS are in audio timebase units. 0 if unknown. The video timebase unit and audio timebase unit are defined under [D4.12] Definitions.

8. PTS, DTS and Presentation Duration values will be calculated or adjusted if necessary, and guessed if the container cannot provide them.

9. PTS can be NOPTS_VALUE (ie. Invalid value for PTS) if the container has B-frames, so in such case the DTS will be set the correct Presentation Time Stamp value, otherwise PTS and DTS will be set to equal values.

[D4.4] Segmentation Process

Segmentation process comprising following steps:

(a) The system creates First-In-First-Out (FIFO) queues, or any other structures where such structures could be read later in FIFO order, to the number of audio visual streams available in the input unless otherwise user requests selected number of streams.

(b) The system reads the next data packet from the input and places it to the relevant FIFO queue/structure.

(c) The system repeats the data packet reading until a FIFO queue/structure exceeds the applicable chunk size.

(d) Once a FIFO queue/structure exceeds its applicable chunk size, that FIFO queue/structure is then written to a new chunk according one of the packaging methods.

(e) Once the data packet reading from the input is completed, all not empty FIFO queues/structures are packaged to new chunks according to the packaging method selected.

FIG. 02 depicts this process diagrammatically.

[D4.5] Chunk Size

The system can be configured to have different chunk sizes per data packet queue.

The system can be configured to differentiate between audio and video, eg. Audio chunk maximum size is 1MB, video chunk maximum size is 4MB. The system can be configured to create different chunk sizes based on the size of the input source, eg. If the input source less than 4GB, video chunk maximum size 1MB, If the input source greater than 4GB, video chunk maximum size 4MB. The system can be configured to create different chunk sizes based on complex criteria such as first N number of chunks a smaller size, eg. 1MB, and next M number in 4MB each and balance can be 16MB each, etc. The system can be configured to create different chunk sizes based on the fact if the chunk distribution is real-time.

[D4.6] Method of packaging data with single metadata file

The method comprising following steps:

(a) From the given data packet collection, identify following: PTSSize, flagsSize, durationSize, and dataLenSize. Where PTSSize, flagsSize, durationSize, and dataLenSize are the minimum sizes (eg. 1, 2, 4, or 8 bytes) required to store the maximum PTS, Flags, Presentation Duration, Data/Payload Length values respectively.

(b) Read the data packet collection in FIFO order and the Presentation Time Stamp, Flags,

Presentation Duration, Data/Payload Length, and Data/Payload, of a data packet are written into a chunk as an one unit or as a record, in binary form. Repeat this step for all data packets in the given data packet collection. Said binary records are written to the chunk one after the other for each data packet.

(c) For each generated chunk, following information is written to the master metadata file under its respective audio, video, or subtitle stream section:

chunklD:PTSSize:flagsSize:du rations ize:dataLenSize:chunkLength:sha Hash

The chunkLength is the length of the chunk in bytes. The shaHash is the digest of a cryptographic hash function with the generated chunk as the input message. The SHA256 cryptographic hash function is preferred.

FIG. 03 depicts this process diagrammatically.

[D4.7] Method of packaging data with multiple metadata files

The method comprising following steps:

(a) Read the given data packet collection in FIFO order and Data/Payload of a data packet is written, in binary form, into a chunk. Said binary Data/Payload of a data packet is written to the chunk one after the other for each data packet.

(b) The Presentation Time Stamp, Flags, Presentation Duration, and Data/Payload Length of the data packet is written in text form into a metadata file as an one unit or as a record, values are converted into its textual representation, separated by a separator (eg. Colon), and records are separated by a end of record marker (eg. Carriage Return, Newline). Said Presentation Time Stamp, Flags,

Presentation Duration, and Data/Payload Length of a data packet is written to the metadata file one after the other for each data packet.

(c) For each generated chunk, following information are written to the master metadata file under its respective audio, video, or subtitle stream section:

metadataFile Name: metadata FileLength: shaHash MetaFile:chunklD:chunkLength:shaHashChunk

The chunkLength is the length of the chunk in bytes. The shaHashMetaFile and shaHashChunk are the digest of a cryptographic hash function with the generated metadata file and chunk respectively, as the input message. The SHA256 cryptographic hash function is preferred.

FIG. 04 depicts this process diagrammatically.

[D4.8] Example of carrying out the innovation

A) Extracts metadata from the container and records to the master metadata file. Following information could be recorded: audio codec, audio bit rate, audio sample rate, number of audio channels, audio timebase, video width, video height, video aspect ratio, video codec, video timebase, video pixel format. B) Read the input as per "[D4.2] Input reading"; Segment the audio visual content as per "[D4.4] Segmentation Process"; Repeat this step until an audio visual FIFO queue exceeds its chunk size. Refer "[D4.5] Chunk Size".

C) Package the chunk size exceeded FIFO queue as per the (1) [D4.6] Method of packaging data with single metadata file or (2) [D4.7] Method of packaging data with multiple metadata files.

D) Loop to step B) until end of input.

[D4.9] Playback according to [D4.6] Method of packaging data with single metadata file

A player first reads the master metadata file and understands the metadata about the video file from <AUDIOINFO> and <VIDEOINFO>. Reads video chunks in the order given from <VIDEO_STREAM> and recreate video packets. Reads audio chunks in the order given from <AUDIO_STREAM> and recreate audio packets. Decode audio and video data if necessary and playback according to the timing or synchronisation data as per packet PTS and Presentation Duration.

[D4.10] Playback according to [D4.7] Method of packaging data with multiple metadata files

A player first reads the master metadata file and understands the metadata about the video file from <AUDIOINFO> and <VIDEOINFO>.

Reads entries as per the order given from <VIDEO_STREAM>; Next reads the metadata file using metadataFileName; Now from the metadata file entry, read Data/Payload Length of bytes from the chunk identified by the chunkID; Recreate the video data packet using the info from the metadata file entry and the data read from the chunk; Process all entries of the metadata file sequentially and read data from the chunk from next byte where it stopped read from last read and recreate video data packets.

Similarly to video, recreate audio data packets from the order given from <AUDIO_STREAM>.

Decode audio and video data if necessary and playback according to the timing or synchronisation data as per packet PTS and Presentation Duration.

For <AUDIOINFO>, <VIDEOINFO>, <AUDIO_STREAM>, and <VIDEO_STREAM>, refer to [D4.ll].

[D4.11] Sample format of the master metadata file according to [D4.6] Method of packaging data with single metadata file

AudioCodecld=value

AudioBitRate= value

AudioSampleRate=value

AudioChannels= value

AudioTimeBase=value

</AUDIOINFO>

VideoWidth=value

VideoHeight=value

AspectRatio=value

PixFmt=value

VideoTimeBase= value

VideoCodecld=value

< VIDEOINFO>

<VIDEO_STREAM>

video_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash

< VIDEO_STREAM> <AUDIO_STREAM>

audio_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash

</AUDIO_STREAM>

Note, some codecs may require additional data at container-level and/or at packet-level.

Further, for 3D videos require two video stream sections in the master metadata file for left and right views as follows:

<VIDEO_STREAM_LEFT>

video_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash

< VIDEO_STREAM_LEFT>

<VIDEO_STREAM_RIGHT>

video_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash

< VIDEO_STREAM_RIGHT>

The description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be carry out in practice.

[D4.12] Definitions

Presentation Time Stamp: Presentation Time Stamp is the time at which the decompressed data packet will be presented to the user. The Presentation Time Stamp is an integer, the size of a Presentation Time Stamp value in a generated chunk is same throughout in that chunk and PTSSize is the minimum size required to record the maximum Presentation Time Stamp value in that chunk.

Flags: The Flags indicates whether the data packet read represents a key frame or i-frame or it's a corrupted packet, etc. The Flags is an integer, the size of a Flags value in a generated chunk is same throughout in that chunk and flagsSize is the minimum size required to record the maximum Flags value in that chunk.

Presentation Duration: The Presentation Duration is an integer, the size of a Presentation Duration value in a generated chunk is same throughout in that chunk and durationSize is the minimum size required to record the maximum Presentation Duration value in that chunk.

Data or Payload Length is an integer, the size of a Data Length value in a generated chunk is same throughout in that chunk and dataLenSize is the minimum size required to record the maximum Data Length value in that file.

Data or Payload is a stream of bytes. It's length is given in Data Length.

chunkID is the identification given to a generated chunk.

chunkLength is the length of the generated chunk in bytes.

shaHash and shaHashChunk is the digest of a cryptographic hash function with the generated chunk as the input message. The SHA256 cryptographic hash function is preferred.

shaHashMetaFile is the digest of a cryptographic hash function with the generated metadata file as the input message.

Audio timebase is 1/audio frame rate. Audio frame rate is number of audio frames per second.

Video timebase is 1/ video frame rate. Video frame rate is number of video frames per second.

Video pixel format represent colour and intensity information of video images of the input source. Example, planar YUV 4:2:0 12bpp, packed YUV 4:2:2 16bpp, packed RGB 8:8:8 24bpp, etc.

Claims

1. A content-aware method of segmenting the audio visual input source to the system and separately package audio, video, subtitle and metadata as audio data into one or more audio chunks, video data into one or more video chunks, subtitle data into one or more subtitle chunks and metadata into one or more metadata files, ready and intended for final delivery.

2. The method according to Claim 1, wherein, comprising a further method of segmenting audio visual data and two further methods of packaging audio visual data: (1) A method of packaging data with single metadata file, (2) A method of packaging data with multiple metadata files.

3. A method of segmenting audio visual data according to Claim 2:

The system creates First-In-First-Out (FIFO) queues, or any other structures where such structures could be read later in FIFO order, to the number of audio visual streams available in the input unless otherwise user requests selected number of streams; the system reads the next data packet from the input and places it to the relevant FIFO queue/structure; the system repeats the data packet reading until a FIFO queue/structure exceeds its applicable chunk size; once a FIFO queue/structure exceeds its applicable chunk size, that FIFO queue/structure is then written to a new chunk according one of the packaging methods as per Claim 2. Once the data packet reading from the input is completed, all not empty FIFO queues/structures are packaged to new chunks according to the packaging method selected.

4. A method of packaging data with single metadata file according to Claim 2:

The method comprising following steps:

Presentation Duration, Data/Payload Length, and Data/Payload, of a data packet are written into a chunk as an one unit or as a record, in binary form. Repeat this step for all data packets in the selected data packet collection. Said binary records are written to the chunk one after the other for each data packet.

chunklD:PTSSize:flagsSize:du rations ize:dataLenSize:chunkLength:sha Hash

5. A method of packaging data with multiple metadata files according to Claim 2:

The method comprising following steps:

metadataFile Name: metadata FileLength:shaHashMetaFile:chunklD:chunkLength:shaHashChunk

6. The input sources to the system according to Claim 1, wherein:

7. The audio, video and subtitle data according to Claim 1, wherein:

The audio, video and subtitle data according to Claim 1 are either compressed or raw data, (1) extracted in terms of data packets suitable for the purpose together with timing and other information related to the said data packet when reading from a container, or (2) when reading raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, and prepare data packets suitable for the purpose with timing and other information related to the said data packet.

8. The data packets according to Claim 7, wherein:

A data packet include one video frame per data packet. A data packet include multiple audio frames per data packet if audio frames are fixed size or one audio frame per data packet if audio frames are variable size. A data packet may not always contain a valid frame or frames. It should also return data in between valid frames. Objective is to provide more data for the decoder.

9. The video frames and audio frames according to Claim 8, wherein:

Codecs being used for audio and video define what is a frame or an access unit is.

10. The metadata according to Claim 1, wherein:

The metadata are audio codec, audio bit rate, audio sample rate, number of audio channels, audio timebase, video width, video height, video aspect ratio, video codec, video timebase, video pixel format, presentation time stamps, presentation durations, information about how to read chunks created, etc.

11. The chunks according to Claim 1, wherein:

A chunk is a part or a piece of an input file or an input source, and produced by the system. Some chunks may contain more than one part or piece of an input file or an input source. Chunks output by the system may kept as separate individual files or pack them in to a single large file.

Further, a chunk itself cannot be read and interpreted without referring to a metadata file produced by the system, whereas metadata files are on its own and can be read and interpreted without aid of another file.

Chunks and their order is identified with the help of the master metadata file.

12. The chunk size according to Claim 3, wherein:

The system can be configured to have different chunk sizes per data packet queue; The system can be configured to create different chunk sizes for audio and video; The system can be configured to create different chunk sizes based on the size of the input source; The system can be configured to create different chunk sizes based on a complex criteria; The system can be configured to create different chunk sizes based on the fact if the chunk distribution is real-time.

13. The metadata files according to Claim 1, wherein:

Global metadata are recorded into a metadata file called the master metadata file. Global metadata do not include per data packet metadata such as Presentation Time Stamps, Presentation Durations, various flags, etc.

As per the Claim 4 (A method of packaging data with single metadata file), per data packet metadata such as Presentation Time Stamps, Presentation Durations, various flags, etc. are recorded alongside with the data itself in chunks.

As per the Claim 5 (A method of packaging data with multiple metadata files), per data packet metadata such as Presentation Time Stamps, Presentation Durations, various flags, etc. are recorded separately into a metadata file specific to the relevant audio, video or subtitle chunk. Therefore, as per the Claim 5 (A method of packaging data with multiple metadata files), one or more metadata files get generated in addition to the master metadata file.

14. The system according to Claim 1, wherein:

The system is a computer software programme which implements the method, content-aware segmenting the audio visual input source and separately package audio, video, subtitle and metadata, claimed as per the Claim 1.

15. The final delivery according to Claim 1, wherein:

Ready to be consumed or played by a compatible audio visual player.