WO2014033504A1 - Method and system for segmenting and separately package audio, video, subtitle and metadata - Google Patents

Method and system for segmenting and separately package audio, video, subtitle and metadata Download PDF

Info

Publication number
WO2014033504A1
WO2014033504A1 PCT/IB2012/054532 IB2012054532W WO2014033504A1 WO 2014033504 A1 WO2014033504 A1 WO 2014033504A1 IB 2012054532 W IB2012054532 W IB 2012054532W WO 2014033504 A1 WO2014033504 A1 WO 2014033504A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
audio
video
metadata
data packet
Prior art date
Application number
PCT/IB2012/054532
Other languages
French (fr)
Inventor
Sagara WIJETUNGA
Original Assignee
Wijetunga Sagara
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wijetunga Sagara filed Critical Wijetunga Sagara
Priority to PCT/IB2012/054532 priority Critical patent/WO2014033504A1/en
Publication of WO2014033504A1 publication Critical patent/WO2014033504A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • This invention is in the fields of content segmentation and packaging audio visual data. Its method of segmentation is content-aware segmentation. It packages in term of chunks, in final form, that is, ready to be consumed or played by a compatible audio visual player.
  • Present method of packaging audio, video and subtitle data is to either (1) encapsulate the audio, video and subtitle data together with timing or synchronisation data into a container format such as a VOB (Video Object), AVI (Audio Video Interleaved, the standard Microsoft Windows container), MPEG program stream, MP4 (standard audio and video container for the MPEG-4 multimedia portfolio), MKV (Matroska), WebM, etc. or (2) encapsulate the audio, video and subtitle data together with timing or synchronisation data into a transport stream such as MPEG transport stream (MPEG-TS, MTS or TS), MPEG-2 Transport Stream (M2TS), etc.
  • a transport stream such as MPEG transport stream (MPEG-TS, MTS or TS), MPEG-2 Transport Stream (M2TS), etc.
  • MPEG transport stream MPEG-TS, MTS or TS
  • M2TS MPEG-2 Transport Stream
  • Present method of packaging audio, video and subtitle data normally results a single file, and depending on the resolution, duration, etc., generates a small file (eg. 5MB, 10MB, etc.) to a very large file (eg. 5GB, 10GB, etc.) in case of high definition video content. It can be even bigger size for 3D high definition video content with multiple left and right video streams with multiple high quality audio channels.
  • a small file eg. 5MB, 10MB, etc.
  • a very large file eg. 5GB, 10GB, etc.
  • Invention disclosed here can generate thousands of chunks for an one hour of high definition video.
  • the claimed invention i.e. Claim 1 is a content-aware method of segmenting the audio visual input source to the system and separately package audio, video, subtitle and metadata as audio data into one or more audio chunks, video data into one or more video chunks, subtitle data into one or more subtitle chunks and metadata into one or more metadata files, ready and intended for final delivery.
  • the content-aware method of segmenting the audio visual input source to the system and separately package audio, video, subtitle and metadata has three (3) broad functions: (1) Input reading (2)
  • Input source is read in terms of data packets as per " [D4.3] Input source reading rules".
  • a data packet contains: (a) one or more audio frames or (b) one video frame or (c) other data in between valid frames depending on codecs used. The system does not decode or uncompress the audio visual data. As per audio frames and video frames, codecs being used for audio and video define what is a frame or an access unit is.
  • the above (2) Segmentation Process The segmentation is content-aware.
  • the data packets read are separated according to its stream.
  • Data packets are queued to First-In-First-Out (FIFO) queues (or to any other structure where it could be read later in FIFO order).
  • FIFO First-In-First-Out
  • the system triggers the packaging. Refer FIG. 02.
  • the above (3) Packaging audio visual data into chunks The packaging is done as per the Claim 4 (A method of packaging data with single metadata file), Refer FIG. 03 or as per the Claim 5 (A method of packaging data with multiple metadata files), Refer FIG. 04. One of these packaging methods could be used to package data.
  • a suitable player which is aware of the data packaging protocol that is, whether the audio visual data are packaged as per the Claim 4 (A method of packaging data with single metadata file) or as per the Claim 5 (A method of packaging data with multiple metadata files), can reconstruct audio visual data with reference to the master metadata file and play in synchronisation.
  • this innovation is useful in content sharing over peer-to-peer (p2p) networks.
  • FIG. 1 is a diagrammatic representation of the present audio visual data packaging methods. It is provided for the purpose of readily visually show the difference of the invention disclosed here compared to existing methods.
  • FIG. 2 is a diagrammatic representation of the Method of segmenting audio visual data.
  • FIG. 3 is a diagrammatic representation of packaging audio visual data with single metadata file.
  • FIG. 4 is a diagrammatic representation of packaging audio visual data with multiple metadata files.
  • the input to the system can be from one or more input sources.
  • Input sources can be of (1) files or (2) transport streams, or (3) read raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, or (4) any combination (1), (2) and (3).
  • AVI Audio Video Interleaved, the standard Microsoft Windows container
  • MKV Microsoft Windows container
  • MP4 standard audio and video container for the MPEG-4 multimedia portfolio
  • Transport streams such as MPEG-2 transport stream (a.k.a. M2TS), etc., where audio, video, subtitle and metadata are encapsulated into a particular container format
  • Input source is read in terms of data packets as per "[D4.3] Input source reading rules".
  • the audio, video and subtitle data are either compressed or raw data, (1) extracted in terms of data packets suitable for the purpose together with timing and other information related to the said data packet when reading from a container, or (2) when reading raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, and prepare data packets suitable for the purpose with timing and other information related to the said data packet.
  • the system does not decode the data before package.
  • the system may optionally encode or compress audio visual data when it reads raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system before package.
  • Data packets can be presented for the content segmentation function of the system in encoded order (i.e. bitstream order) or in display order.
  • the display order is preferred.
  • the system may be configured to adjust/calculate the presentation time stamps and presentation durations where necessary.
  • Input source reading is done by a function as per following rules:
  • Streams are audio, video and subtitle streams.
  • the data packet has a defined data structure.
  • PTS Presentation Time Stamp
  • DTS Decompression Time Stamp
  • Data/Payload a pointer to data/payload
  • Data Length ie. which stream, eg., audio, video or subtitle
  • Flags ie. which stream, eg., audio, video or subtitle
  • Presentation Duration ie. which packet, eg., audio, video or subtitle
  • the returned data packet may not always contain a valid frame or frames. It should also return data in between valid frames. Objective is to provide more data for the decoder.
  • the data packet contains exactly one video frame.
  • audio For audio, multiple audio frames per data packet if audio frames are fixed size, if audio frames are variable size, one audio frame per data packet.
  • Video PTS, Video DTS and Presentation Duration are in video timebase units.
  • Audio PTS and Audio DTS are in audio timebase units. 0 if unknown.
  • the video timebase unit and audio timebase unit are defined under [D4.12] Definitions.
  • PTS, DTS and Presentation Duration values will be calculated or adjusted if necessary, and guessed if the container cannot provide them.
  • PTS can be NOPTS_VALUE (ie. Invalid value for PTS) if the container has B-frames, so in such case the DTS will be set the correct Presentation Time Stamp value, otherwise PTS and DTS will be set to equal values.
  • NOPTS_VALUE ie. Invalid value for PTS
  • FIFO First-In-First-Out
  • FIG. 02 depicts this process diagrammatically.
  • the system can be configured to have different chunk sizes per data packet queue.
  • the system can be configured to differentiate between audio and video, eg. Audio chunk maximum size is 1MB, video chunk maximum size is 4MB.
  • the system can be configured to create different chunk sizes based on the size of the input source, eg. If the input source less than 4GB, video chunk maximum size 1MB, If the input source greater than 4GB, video chunk maximum size 4MB.
  • the system can be configured to create different chunk sizes based on complex criteria such as first N number of chunks a smaller size, eg. 1MB, and next M number in 4MB each and balance can be 16MB each, etc.
  • the system can be configured to create different chunk sizes based on the fact if the chunk distribution is real-time.
  • the method comprising following steps:
  • PTSSize From the given data packet collection, identify following: PTSSize, flagsSize, durationSize, and dataLenSize. Where PTSSize, flagsSize, durationSize, and dataLenSize are the minimum sizes (eg. 1, 2, 4, or 8 bytes) required to store the maximum PTS, Flags, Presentation Duration, Data/Payload Length values respectively.
  • Presentation Duration, Data/Payload Length, and Data/Payload, of a data packet are written into a chunk as an one unit or as a record, in binary form. Repeat this step for all data packets in the given data packet collection. Said binary records are written to the chunk one after the other for each data packet.
  • the chunkLength is the length of the chunk in bytes.
  • the shaHash is the digest of a cryptographic hash function with the generated chunk as the input message.
  • the SHA256 cryptographic hash function is preferred.
  • FIG. 03 depicts this process diagrammatically.
  • the method comprising following steps:
  • MetaFile Name metadata FileLength: shaHash MetaFile:chunklD:chunkLength:shaHashChunk
  • the chunkLength is the length of the chunk in bytes.
  • the shaHashMetaFile and shaHashChunk are the digest of a cryptographic hash function with the generated metadata file and chunk respectively, as the input message.
  • the SHA256 cryptographic hash function is preferred.
  • FIG. 04 depicts this process diagrammatically.
  • step D) Loop to step B) until end of input.
  • a player first reads the master metadata file and understands the metadata about the video file from ⁇ AUDIOINFO> and ⁇ VIDEOINFO>. Reads video chunks in the order given from ⁇ VIDEO_STREAM> and recreate video packets. Reads audio chunks in the order given from ⁇ AUDIO_STREAM> and recreate audio packets. Decode audio and video data if necessary and playback according to the timing or synchronisation data as per packet PTS and Presentation Duration.
  • a player first reads the master metadata file and understands the metadata about the video file from ⁇ AUDIOINFO> and ⁇ VIDEOINFO>.
  • VideoCodecld value
  • codecs may require additional data at container-level and/or at packet-level.
  • Presentation Time Stamp is the time at which the decompressed data packet will be presented to the user.
  • the Presentation Time Stamp is an integer, the size of a Presentation Time Stamp value in a generated chunk is same throughout in that chunk and PTSSize is the minimum size required to record the maximum Presentation Time Stamp value in that chunk.
  • the Flags indicates whether the data packet read represents a key frame or i-frame or it's a corrupted packet, etc.
  • the Flags is an integer, the size of a Flags value in a generated chunk is same throughout in that chunk and flagsSize is the minimum size required to record the maximum Flags value in that chunk.
  • the Presentation Duration is an integer, the size of a Presentation Duration value in a generated chunk is same throughout in that chunk and durationSize is the minimum size required to record the maximum Presentation Duration value in that chunk.
  • Data or Payload Length is an integer, the size of a Data Length value in a generated chunk is same throughout in that chunk and dataLenSize is the minimum size required to record the maximum Data Length value in that file.
  • Data or Payload is a stream of bytes. It's length is given in Data Length.
  • chunkID is the identification given to a generated chunk.
  • chunkLength is the length of the generated chunk in bytes.
  • shaHash and shaHashChunk is the digest of a cryptographic hash function with the generated chunk as the input message.
  • the SHA256 cryptographic hash function is preferred.
  • shaHashMetaFile is the digest of a cryptographic hash function with the generated metadata file as the input message.
  • Audio timebase is 1/audio frame rate. Audio frame rate is number of audio frames per second.
  • Video timebase is 1/ video frame rate.
  • Video frame rate is number of video frames per second.
  • Video pixel format represent colour and intensity information of video images of the input source.

Abstract

A content-aware method of segmenting and packaging audio, video, subtitle and metadata is disclosed here. The method take input from one or more input sources where the audio, video, subtitle and metadata are encapsulated into one or more container formats or the method reads raw audio data from an audio capturing device/system such as a microphone and/or reads raw video data from a video capturing device/system such as an image sensor or a video camera system. The method extracts audio, video and subtitle data together with their timing information from the stored container or in the case of the method reads raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, optionally encode audio and/or video data and package them separately as audio data into one or more audio chunks, video data into one or more video chunks, subtitle data into one or more subtitle chunks and metadata into one or more metadata files, ready and intended for final delivery. Chunks output by the method may be kept as separate individual files or be packed into a single large file.

Description

METHOD AND SYSTEM FOR SEGMENTING AND SEPARATELY
PACKAGE AUDIO, VIDEO, SUBTITLE AND METADATA
DESCRIPTION
The invention described here, by way of example only, with reference to the accompanying drawings. In th regard, no attempt is made to show structural details of the invention in more detail than is necessary for fundamental understanding of the invention.
[Dl] FIELD AND BACKGROUND OF THE INVENTION
This invention is in the fields of content segmentation and packaging audio visual data. Its method of segmentation is content-aware segmentation. It packages in term of chunks, in final form, that is, ready to be consumed or played by a compatible audio visual player.
Present method of packaging audio, video and subtitle data, is to either (1) encapsulate the audio, video and subtitle data together with timing or synchronisation data into a container format such as a VOB (Video Object), AVI (Audio Video Interleaved, the standard Microsoft Windows container), MPEG program stream, MP4 (standard audio and video container for the MPEG-4 multimedia portfolio), MKV (Matroska), WebM, etc. or (2) encapsulate the audio, video and subtitle data together with timing or synchronisation data into a transport stream such as MPEG transport stream (MPEG-TS, MTS or TS), MPEG-2 Transport Stream (M2TS), etc. Note, not all container formats support subtitles. Refer FIG. 01.
Present method of packaging audio, video and subtitle data normally results a single file, and depending on the resolution, duration, etc., generates a small file (eg. 5MB, 10MB, etc.) to a very large file (eg. 5GB, 10GB, etc.) in case of high definition video content. It can be even bigger size for 3D high definition video content with multiple left and right video streams with multiple high quality audio channels.
Invention disclosed here can generate thousands of chunks for an one hour of high definition video.
[D2] SUMMARY OF THE INVENTION
The claimed invention (i.e. Claim 1) is a content-aware method of segmenting the audio visual input source to the system and separately package audio, video, subtitle and metadata as audio data into one or more audio chunks, video data into one or more video chunks, subtitle data into one or more subtitle chunks and metadata into one or more metadata files, ready and intended for final delivery.
The content-aware method of segmenting the audio visual input source to the system and separately package audio, video, subtitle and metadata has three (3) broad functions: (1) Input reading (2)
Segmentation Process and (3) Packaging audio visual data into chunks.
The above (1) Input reading: Input source is read in terms of data packets as per " [D4.3] Input source reading rules". A data packet contains: (a) one or more audio frames or (b) one video frame or (c) other data in between valid frames depending on codecs used. The system does not decode or uncompress the audio visual data. As per audio frames and video frames, codecs being used for audio and video define what is a frame or an access unit is.
The above (2) Segmentation Process: The segmentation is content-aware. The data packets read are separated according to its stream. Data packets are queued to First-In-First-Out (FIFO) queues (or to any other structure where it could be read later in FIFO order). When a FIFO queue/structure exceeds its relevant chunk size, the system triggers the packaging. Refer FIG. 02.
The above (3) Packaging audio visual data into chunks: The packaging is done as per the Claim 4 (A method of packaging data with single metadata file), Refer FIG. 03 or as per the Claim 5 (A method of packaging data with multiple metadata files), Refer FIG. 04. One of these packaging methods could be used to package data.
As per the Claim 4 (A method of packaging data with single metadata file), system generates only one metadata file, known as the master metadata file. Refer FIG. 03.
As per the Claim 5 (A method of packaging data with multiple metadata files), system generates multiple metadata files. The master metadata file will always get generated. Refer FIG. 04.
A suitable player which is aware of the data packaging protocol, that is, whether the audio visual data are packaged as per the Claim 4 (A method of packaging data with single metadata file) or as per the Claim 5 (A method of packaging data with multiple metadata files), can reconstruct audio visual data with reference to the master metadata file and play in synchronisation.
As per the innovation disclosed, segment audio visual content into smaller chunks using a content-aware segmentation method, this innovation is useful in content sharing over peer-to-peer (p2p) networks.
[D3] BRIEF DESCRIPTION OF THE DRAWINGS
With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention, and are presented in the context of the most useful and readily understood description of the principles and conceptual aspects of the invention.
FIG. 1 is a diagrammatic representation of the present audio visual data packaging methods. It is provided for the purpose of readily visually show the difference of the invention disclosed here compared to existing methods.
FIG. 2 is a diagrammatic representation of the Method of segmenting audio visual data.
FIG. 3 is a diagrammatic representation of packaging audio visual data with single metadata file.
FIG. 4 is a diagrammatic representation of packaging audio visual data with multiple metadata files.
[D4] DETAILS OF CARRY OUT THE INNOVATION
[D4.1] Input to the system
The input to the system can be from one or more input sources. Input sources can be of (1) files or (2) transport streams, or (3) read raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, or (4) any combination (1), (2) and (3).
Example: (a) It can be a single file where audio, video, subtitle and metadata are encapsulated into a container format such as AVI (Audio Video Interleaved, the standard Microsoft Windows container), MKV (Matroska), MP4 (standard audio and video container for the MPEG-4 multimedia portfolio), etc., (b) Transport streams, such as MPEG-2 transport stream (a.k.a. M2TS), etc., where audio, video, subtitle and metadata are encapsulated into a particular container format, (c) Direct reading from a microphone and/or image sensor or video camera, etc.
[D4.2] Input reading
Input source is read in terms of data packets as per "[D4.3] Input source reading rules".
The audio, video and subtitle data are either compressed or raw data, (1) extracted in terms of data packets suitable for the purpose together with timing and other information related to the said data packet when reading from a container, or (2) when reading raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, and prepare data packets suitable for the purpose with timing and other information related to the said data packet.
The system does not decode the data before package. The system may optionally encode or compress audio visual data when it reads raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system before package.
Data packets can be presented for the content segmentation function of the system in encoded order (i.e. bitstream order) or in display order. The display order is preferred. The system may be configured to adjust/calculate the presentation time stamps and presentation durations where necessary. [D4.3] Input source reading rules
Input source reading is done by a function as per following rules:
1. Reads the next data packet of a stream. Streams are audio, video and subtitle streams.
2. The data packet has a defined data structure. PTS (Presentation Time Stamp), DTS (Decompression Time Stamp), Data/Payload (a pointer to data/payload), Data Length, streamjndex (ie. which stream, eg., audio, video or subtitle), Flags, Presentation Duration (Duration of this packet) are essential fields. These fields are defined under [D4.12] Definitions.
3. The returned data packet may not always contain a valid frame or frames. It should also return data in between valid frames. Objective is to provide more data for the decoder.
4. For video, the data packet contains exactly one video frame.
5. For audio, multiple audio frames per data packet if audio frames are fixed size, if audio frames are variable size, one audio frame per data packet.
6. Returns data packets in display order.
7. Video PTS, Video DTS and Presentation Duration are in video timebase units. Audio PTS and Audio DTS are in audio timebase units. 0 if unknown. The video timebase unit and audio timebase unit are defined under [D4.12] Definitions.
8. PTS, DTS and Presentation Duration values will be calculated or adjusted if necessary, and guessed if the container cannot provide them.
9. PTS can be NOPTS_VALUE (ie. Invalid value for PTS) if the container has B-frames, so in such case the DTS will be set the correct Presentation Time Stamp value, otherwise PTS and DTS will be set to equal values.
[D4.4] Segmentation Process
Segmentation process comprising following steps:
(a) The system creates First-In-First-Out (FIFO) queues, or any other structures where such structures could be read later in FIFO order, to the number of audio visual streams available in the input unless otherwise user requests selected number of streams.
(b) The system reads the next data packet from the input and places it to the relevant FIFO queue/structure.
(c) The system repeats the data packet reading until a FIFO queue/structure exceeds the applicable chunk size.
(d) Once a FIFO queue/structure exceeds its applicable chunk size, that FIFO queue/structure is then written to a new chunk according one of the packaging methods.
(e) Once the data packet reading from the input is completed, all not empty FIFO queues/structures are packaged to new chunks according to the packaging method selected.
FIG. 02 depicts this process diagrammatically.
[D4.5] Chunk Size
The system can be configured to have different chunk sizes per data packet queue.
The system can be configured to differentiate between audio and video, eg. Audio chunk maximum size is 1MB, video chunk maximum size is 4MB. The system can be configured to create different chunk sizes based on the size of the input source, eg. If the input source less than 4GB, video chunk maximum size 1MB, If the input source greater than 4GB, video chunk maximum size 4MB. The system can be configured to create different chunk sizes based on complex criteria such as first N number of chunks a smaller size, eg. 1MB, and next M number in 4MB each and balance can be 16MB each, etc. The system can be configured to create different chunk sizes based on the fact if the chunk distribution is real-time.
[D4.6] Method of packaging data with single metadata file
The method comprising following steps:
(a) From the given data packet collection, identify following: PTSSize, flagsSize, durationSize, and dataLenSize. Where PTSSize, flagsSize, durationSize, and dataLenSize are the minimum sizes (eg. 1, 2, 4, or 8 bytes) required to store the maximum PTS, Flags, Presentation Duration, Data/Payload Length values respectively.
(b) Read the data packet collection in FIFO order and the Presentation Time Stamp, Flags,
Presentation Duration, Data/Payload Length, and Data/Payload, of a data packet are written into a chunk as an one unit or as a record, in binary form. Repeat this step for all data packets in the given data packet collection. Said binary records are written to the chunk one after the other for each data packet.
(c) For each generated chunk, following information is written to the master metadata file under its respective audio, video, or subtitle stream section:
chunklD:PTSSize:flagsSize:du rations ize:dataLenSize:chunkLength:sha Hash
The chunkLength is the length of the chunk in bytes. The shaHash is the digest of a cryptographic hash function with the generated chunk as the input message. The SHA256 cryptographic hash function is preferred.
FIG. 03 depicts this process diagrammatically.
[D4.7] Method of packaging data with multiple metadata files
The method comprising following steps:
(a) Read the given data packet collection in FIFO order and Data/Payload of a data packet is written, in binary form, into a chunk. Said binary Data/Payload of a data packet is written to the chunk one after the other for each data packet.
(b) The Presentation Time Stamp, Flags, Presentation Duration, and Data/Payload Length of the data packet is written in text form into a metadata file as an one unit or as a record, values are converted into its textual representation, separated by a separator (eg. Colon), and records are separated by a end of record marker (eg. Carriage Return, Newline). Said Presentation Time Stamp, Flags,
Presentation Duration, and Data/Payload Length of a data packet is written to the metadata file one after the other for each data packet.
(c) For each generated chunk, following information are written to the master metadata file under its respective audio, video, or subtitle stream section:
metadataFile Name: metadata FileLength: shaHash MetaFile:chunklD:chunkLength:shaHashChunk
The chunkLength is the length of the chunk in bytes. The shaHashMetaFile and shaHashChunk are the digest of a cryptographic hash function with the generated metadata file and chunk respectively, as the input message. The SHA256 cryptographic hash function is preferred.
FIG. 04 depicts this process diagrammatically.
[D4.8] Example of carrying out the innovation
A) Extracts metadata from the container and records to the master metadata file. Following information could be recorded: audio codec, audio bit rate, audio sample rate, number of audio channels, audio timebase, video width, video height, video aspect ratio, video codec, video timebase, video pixel format. B) Read the input as per "[D4.2] Input reading"; Segment the audio visual content as per "[D4.4] Segmentation Process"; Repeat this step until an audio visual FIFO queue exceeds its chunk size. Refer "[D4.5] Chunk Size".
C) Package the chunk size exceeded FIFO queue as per the (1) [D4.6] Method of packaging data with single metadata file or (2) [D4.7] Method of packaging data with multiple metadata files.
D) Loop to step B) until end of input.
[D4.9] Playback according to [D4.6] Method of packaging data with single metadata file
A player first reads the master metadata file and understands the metadata about the video file from <AUDIOINFO> and <VIDEOINFO>. Reads video chunks in the order given from <VIDEO_STREAM> and recreate video packets. Reads audio chunks in the order given from <AUDIO_STREAM> and recreate audio packets. Decode audio and video data if necessary and playback according to the timing or synchronisation data as per packet PTS and Presentation Duration.
[D4.10] Playback according to [D4.7] Method of packaging data with multiple metadata files
A player first reads the master metadata file and understands the metadata about the video file from <AUDIOINFO> and <VIDEOINFO>.
Reads entries as per the order given from <VIDEO_STREAM>; Next reads the metadata file using metadataFileName; Now from the metadata file entry, read Data/Payload Length of bytes from the chunk identified by the chunkID; Recreate the video data packet using the info from the metadata file entry and the data read from the chunk; Process all entries of the metadata file sequentially and read data from the chunk from next byte where it stopped read from last read and recreate video data packets.
Similarly to video, recreate audio data packets from the order given from <AUDIO_STREAM>.
Decode audio and video data if necessary and playback according to the timing or synchronisation data as per packet PTS and Presentation Duration.
For <AUDIOINFO>, <VIDEOINFO>, <AUDIO_STREAM>, and <VIDEO_STREAM>, refer to [D4.ll].
[D4.11] Sample format of the master metadata file according to [D4.6] Method of packaging data with single metadata file
<AUDIOINFO>
AudioCodecld=value
AudioBitRate= value
AudioSampleRate=value
AudioChannels= value
AudioTimeBase=value
</AUDIOINFO>
<VIDEOINFO>
VideoWidth=value
VideoHeight=value
AspectRatio=value
PixFmt=value
VideoTimeBase= value
VideoCodecld=value
< VIDEOINFO>
<VIDEO_STREAM>
video_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash
video_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash
< VIDEO_STREAM> <AUDIO_STREAM>
audio_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash
audio_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash
</AUDIO_STREAM>
Note, some codecs may require additional data at container-level and/or at packet-level.
Further, for 3D videos require two video stream sections in the master metadata file for left and right views as follows:
<VIDEO_STREAM_LEFT>
video_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash
video_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash
< VIDEO_STREAM_LEFT>
<VIDEO_STREAM_RIGHT>
video_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash
video_chunklD:PTSSize:flagsSize:durationSize:dataLenSize:chunkLength:shaHash
< VIDEO_STREAM_RIGHT>
The description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be carry out in practice.
[D4.12] Definitions
Presentation Time Stamp: Presentation Time Stamp is the time at which the decompressed data packet will be presented to the user. The Presentation Time Stamp is an integer, the size of a Presentation Time Stamp value in a generated chunk is same throughout in that chunk and PTSSize is the minimum size required to record the maximum Presentation Time Stamp value in that chunk.
Flags: The Flags indicates whether the data packet read represents a key frame or i-frame or it's a corrupted packet, etc. The Flags is an integer, the size of a Flags value in a generated chunk is same throughout in that chunk and flagsSize is the minimum size required to record the maximum Flags value in that chunk.
Presentation Duration: The Presentation Duration is an integer, the size of a Presentation Duration value in a generated chunk is same throughout in that chunk and durationSize is the minimum size required to record the maximum Presentation Duration value in that chunk.
Data or Payload Length is an integer, the size of a Data Length value in a generated chunk is same throughout in that chunk and dataLenSize is the minimum size required to record the maximum Data Length value in that file.
Data or Payload is a stream of bytes. It's length is given in Data Length.
chunkID is the identification given to a generated chunk.
chunkLength is the length of the generated chunk in bytes.
shaHash and shaHashChunk is the digest of a cryptographic hash function with the generated chunk as the input message. The SHA256 cryptographic hash function is preferred.
shaHashMetaFile is the digest of a cryptographic hash function with the generated metadata file as the input message.
Audio timebase is 1/audio frame rate. Audio frame rate is number of audio frames per second.
Video timebase is 1/ video frame rate. Video frame rate is number of video frames per second.
Video pixel format represent colour and intensity information of video images of the input source. Example, planar YUV 4:2:0 12bpp, packed YUV 4:2:2 16bpp, packed RGB 8:8:8 24bpp, etc.

Claims

1. A content-aware method of segmenting the audio visual input source to the system and separately package audio, video, subtitle and metadata as audio data into one or more audio chunks, video data into one or more video chunks, subtitle data into one or more subtitle chunks and metadata into one or more metadata files, ready and intended for final delivery.
2. The method according to Claim 1, wherein, comprising a further method of segmenting audio visual data and two further methods of packaging audio visual data: (1) A method of packaging data with single metadata file, (2) A method of packaging data with multiple metadata files.
3. A method of segmenting audio visual data according to Claim 2:
The system creates First-In-First-Out (FIFO) queues, or any other structures where such structures could be read later in FIFO order, to the number of audio visual streams available in the input unless otherwise user requests selected number of streams; the system reads the next data packet from the input and places it to the relevant FIFO queue/structure; the system repeats the data packet reading until a FIFO queue/structure exceeds its applicable chunk size; once a FIFO queue/structure exceeds its applicable chunk size, that FIFO queue/structure is then written to a new chunk according one of the packaging methods as per Claim 2. Once the data packet reading from the input is completed, all not empty FIFO queues/structures are packaged to new chunks according to the packaging method selected.
4. A method of packaging data with single metadata file according to Claim 2:
The method comprising following steps:
(a) From the given data packet collection, identify following: PTSSize, flagsSize, durationSize, and dataLenSize. Where PTSSize, flagsSize, durationSize, and dataLenSize are the minimum sizes (eg. 1, 2, 4, or 8 bytes) required to store the maximum PTS, Flags, Presentation Duration, Data/Payload Length values respectively.
(b) Read the data packet collection in FIFO order and the Presentation Time Stamp, Flags,
Presentation Duration, Data/Payload Length, and Data/Payload, of a data packet are written into a chunk as an one unit or as a record, in binary form. Repeat this step for all data packets in the selected data packet collection. Said binary records are written to the chunk one after the other for each data packet.
(c) For each generated chunk, following information is written to the master metadata file under its respective audio, video, or subtitle stream section:
chunklD:PTSSize:flagsSize:du rations ize:dataLenSize:chunkLength:sha Hash
5. A method of packaging data with multiple metadata files according to Claim 2:
The method comprising following steps:
(a) Read the given data packet collection in FIFO order and Data/Payload of a data packet is written, in binary form, into a chunk. Said binary Data/Payload of a data packet is written to the chunk one after the other for each data packet.
(b) The Presentation Time Stamp, Flags, Presentation Duration, and Data/Payload Length of the data packet is written in text form into a metadata file as an one unit or as a record, values are converted into its textual representation, separated by a separator (eg. Colon), and records are separated by a end of record marker (eg. Carriage Return, Newline). Said Presentation Time Stamp, Flags,
Presentation Duration, and Data/Payload Length of a data packet is written to the metadata file one after the other for each data packet.
(c) For each generated chunk, following information are written to the master metadata file under its respective audio, video, or subtitle stream section:
metadataFile Name: metadata FileLength:shaHashMetaFile:chunklD:chunkLength:shaHashChunk
6. The input sources to the system according to Claim 1, wherein:
The input to the system can be from one or more input sources. Input sources can be of (1) files or (2) transport streams, or (3) read raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, or (4) any combination (1), (2) and (3).
7. The audio, video and subtitle data according to Claim 1, wherein:
The audio, video and subtitle data according to Claim 1 are either compressed or raw data, (1) extracted in terms of data packets suitable for the purpose together with timing and other information related to the said data packet when reading from a container, or (2) when reading raw audio data from an audio capturing device/system and/or raw video data from a video capturing device/system, and prepare data packets suitable for the purpose with timing and other information related to the said data packet.
8. The data packets according to Claim 7, wherein:
A data packet include one video frame per data packet. A data packet include multiple audio frames per data packet if audio frames are fixed size or one audio frame per data packet if audio frames are variable size. A data packet may not always contain a valid frame or frames. It should also return data in between valid frames. Objective is to provide more data for the decoder.
9. The video frames and audio frames according to Claim 8, wherein:
Codecs being used for audio and video define what is a frame or an access unit is.
10. The metadata according to Claim 1, wherein:
The metadata are audio codec, audio bit rate, audio sample rate, number of audio channels, audio timebase, video width, video height, video aspect ratio, video codec, video timebase, video pixel format, presentation time stamps, presentation durations, information about how to read chunks created, etc.
11. The chunks according to Claim 1, wherein:
A chunk is a part or a piece of an input file or an input source, and produced by the system. Some chunks may contain more than one part or piece of an input file or an input source. Chunks output by the system may kept as separate individual files or pack them in to a single large file.
Further, a chunk itself cannot be read and interpreted without referring to a metadata file produced by the system, whereas metadata files are on its own and can be read and interpreted without aid of another file.
Chunks and their order is identified with the help of the master metadata file.
12. The chunk size according to Claim 3, wherein:
The system can be configured to have different chunk sizes per data packet queue; The system can be configured to create different chunk sizes for audio and video; The system can be configured to create different chunk sizes based on the size of the input source; The system can be configured to create different chunk sizes based on a complex criteria; The system can be configured to create different chunk sizes based on the fact if the chunk distribution is real-time.
13. The metadata files according to Claim 1, wherein:
Global metadata are recorded into a metadata file called the master metadata file. Global metadata do not include per data packet metadata such as Presentation Time Stamps, Presentation Durations, various flags, etc.
As per the Claim 4 (A method of packaging data with single metadata file), per data packet metadata such as Presentation Time Stamps, Presentation Durations, various flags, etc. are recorded alongside with the data itself in chunks.
As per the Claim 5 (A method of packaging data with multiple metadata files), per data packet metadata such as Presentation Time Stamps, Presentation Durations, various flags, etc. are recorded separately into a metadata file specific to the relevant audio, video or subtitle chunk. Therefore, as per the Claim 5 (A method of packaging data with multiple metadata files), one or more metadata files get generated in addition to the master metadata file.
14. The system according to Claim 1, wherein:
The system is a computer software programme which implements the method, content-aware segmenting the audio visual input source and separately package audio, video, subtitle and metadata, claimed as per the Claim 1.
15. The final delivery according to Claim 1, wherein:
Ready to be consumed or played by a compatible audio visual player.
PCT/IB2012/054532 2012-09-03 2012-09-03 Method and system for segmenting and separately package audio, video, subtitle and metadata WO2014033504A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2012/054532 WO2014033504A1 (en) 2012-09-03 2012-09-03 Method and system for segmenting and separately package audio, video, subtitle and metadata

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2012/054532 WO2014033504A1 (en) 2012-09-03 2012-09-03 Method and system for segmenting and separately package audio, video, subtitle and metadata

Publications (1)

Publication Number Publication Date
WO2014033504A1 true WO2014033504A1 (en) 2014-03-06

Family

ID=46963987

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2012/054532 WO2014033504A1 (en) 2012-09-03 2012-09-03 Method and system for segmenting and separately package audio, video, subtitle and metadata

Country Status (1)

Country Link
WO (1) WO2014033504A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108055574A (en) * 2017-11-29 2018-05-18 上海网达软件股份有限公司 Media file transcoding generates the method and system of multitone rail multi-subtitle on-demand content
CN112261377A (en) * 2020-10-23 2021-01-22 青岛以萨数据技术有限公司 Web version monitoring video playing method, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110208829A1 (en) * 2010-02-23 2011-08-25 Samsung Electronics Co., Ltd. Method and apparatus for transmitting and receiving data
EP2393084A1 (en) * 2010-06-02 2011-12-07 Funai Electric Co., Ltd. Apparatus for playing AVI (Audio Visual Interleaving) files
US20120013746A1 (en) * 2010-07-15 2012-01-19 Qualcomm Incorporated Signaling data for multiplexing video components
WO2012114107A2 (en) * 2011-02-25 2012-08-30 British Sky Broadcasting Limited Media system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110208829A1 (en) * 2010-02-23 2011-08-25 Samsung Electronics Co., Ltd. Method and apparatus for transmitting and receiving data
EP2393084A1 (en) * 2010-06-02 2011-12-07 Funai Electric Co., Ltd. Apparatus for playing AVI (Audio Visual Interleaving) files
US20120013746A1 (en) * 2010-07-15 2012-01-19 Qualcomm Incorporated Signaling data for multiplexing video components
WO2012114107A2 (en) * 2011-02-25 2012-08-30 British Sky Broadcasting Limited Media system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
THOMAS SCHIERL ET AL: "Transport and Storage Systems for 3-D Video Using MPEG-2 Systems, RTP, and ISO File Format", PROCEEDINGS OF THE IEEE, IEEE. NEW YORK, US, vol. 99, no. 4, 1 April 2011 (2011-04-01), pages 671 - 683, XP011363622, ISSN: 0018-9219, DOI: 10.1109/JPROC.2010.2091370 *
THOMAS STOCKHAMMER: "Dynamic Adaptive Streaming over HTTP Design Principles and Standards", 22 January 2011 (2011-01-22), pages 1 - 3, XP007916969, Retrieved from the Internet <URL:http://www.w3.org/2010/11/web-and-tv/papers.html> [retrieved on 20110202] *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108055574A (en) * 2017-11-29 2018-05-18 上海网达软件股份有限公司 Media file transcoding generates the method and system of multitone rail multi-subtitle on-demand content
CN112261377A (en) * 2020-10-23 2021-01-22 青岛以萨数据技术有限公司 Web version monitoring video playing method, electronic equipment and storage medium
CN112261377B (en) * 2020-10-23 2023-07-04 青岛以萨数据技术有限公司 Web edition monitoring video playing method, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
TW548925B (en) Method and apparatus for converting data streams
TWI327028B (en) Systems and methods for stream format conversion
KR101885852B1 (en) Method and apparatus for transmitting and receiving content
CN104661058B (en) Data flow transmission method, client and the VOD system of MP4 video request programs
CN113170239B (en) Method, apparatus and storage medium for encapsulating media data into media files
US20100135646A1 (en) Storage/playback method and apparatus for mpeg-2 transport stream based on iso base media file format
JP2019134489A (en) Reproduction method, content transmission method, reproduction device, and content transmission device
CA2522022A1 (en) Information recording medium, device and method for recording information in information recording medium
CN1157067C (en) Video decoder for synchronous decoding displaying using image as unit
KR20090109284A (en) Method and apparatus for providing and receiving three-dimensional digital contents
US7558296B2 (en) Multiplexer and demultiplexer
US10382835B2 (en) Apparatus and method for verifying the integrity of video file
CN106657113B (en) A kind of conversion method and system of multiplexing protocols in broadcast network
US20180376180A1 (en) Method and apparatus for metadata insertion pipeline for streaming media
JP2013093755A (en) Video recording device and video recording method
CN100416689C (en) Reproducing apparatus and method, and recording medium
JP2005123907A (en) Data reconstruction apparatus
WO2014033504A1 (en) Method and system for segmenting and separately package audio, video, subtitle and metadata
JP2016072858A (en) Media data generation method, media data reproduction method, media data generation device, media data reproduction device, computer readable recording medium and program
KR20150045349A (en) Method and apparatus for constructing sensory effect media data file, method and apparatus for playing sensory effect media data file and structure of the sensory effect media data file
US10104142B2 (en) Data processing device, data processing method, program, recording medium, and data processing system
CN109982113B (en) Video file processing method and device
WO2015083354A1 (en) File generation method, playback method, file generation device, playback device, and recording medium
US20120269256A1 (en) Apparatus and method for producing/regenerating contents including mpeg-2 transport streams using screen description
TWI713364B (en) Method for encoding raw high frame rate video via an existing hd video architecture

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12766716

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12766716

Country of ref document: EP

Kind code of ref document: A1