JP5591932B2

JP5591932B2 - Media extractor track for file format track selection

Info

Publication number: JP5591932B2
Application number: JP2012529954A
Authority: JP
Inventors: チェン、イン; カークゼウィックズ、マルタ
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 2009-09-22
Filing date: 2010-09-17
Publication date: 2014-09-17
Anticipated expiration: 2030-09-17
Also published as: CN102714715A; TWI458334B; KR101290467B1; CN102714715B; TW201119346A; JP2013505646A; KR20120116903A

Description

本出願は、それぞれの内容全体が参照により本明細書に組み込まれる、２００９年９月１６日に出願された米国仮出願第６１／２４３，０３０号、２００９年９月２２日に出願された米国仮出願第６１／２４４，８２７号、２０１０年１月１１日に出願された米国仮出願第６１／２９３，９６１号、および２０１０年１月１５日に出願された米国仮出願第６１／２９５，２６１号の利益を主張する。 This application is a US provisional application 61 / 243,030 filed on September 16, 2009, filed September 22, 2009, the entire contents of each of which are incorporated herein by reference. Provisional application 61 / 244,827, US provisional application 61 / 293,961 filed on January 11, 2010, and US provisional application 61/295, filed January 15, 2010, Claim the profit of 261.

本開示は、符号化ビデオデータのトランスポートに関する。 The present disclosure relates to transport of encoded video data.

デジタルビデオ機能は、デジタルテレビジョン、デジタルダイレクトブロードキャストシステム、ワイヤレスブロードキャストシステム、携帯情報端末（ＰＤＡ）、ラップトップまたはデスクトップコンピュータ、デジタルカメラ、デジタル記録デバイス、デジタルメディアプレーヤ、ビデオゲーム機、ビデオゲームコンソール、セルラー電話または衛星無線電話、ビデオ遠隔会議デバイスなどを含む、広範囲にわたるデバイスに組み込まれ得る。デジタルビデオデバイスは、ＭＰＥＧ−２、ＭＰＥＧ−４、ＩＴＵ−ＴＨ．２６３またはＩＴＵ−ＴＨ．２６４／ＭＰＥＧ−４、Ｐａｒｔ１０、アドバンストビデオコーディング（ＡＶＣ：Advanced Video Coding）、およびそのような規格の拡張によって定義された規格に記載されているビデオ圧縮技法などのビデオ圧縮技法を実装して、デジタルビデオ情報をより効率的に送信および受信する。 Digital video functions include digital television, digital direct broadcast system, wireless broadcast system, personal digital assistant (PDA), laptop or desktop computer, digital camera, digital recording device, digital media player, video game console, video game console, It can be incorporated into a wide range of devices, including cellular or satellite radiotelephones, video teleconferencing devices, and the like. Digital video devices are MPEG-2, MPEG-4, ITU-T H.264, and so on. 263 or ITU-T H.264. Implement video compression techniques such as H.264 / MPEG-4, Part 10, Advanced Video Coding (AVC), and video compression techniques described in standards defined by extensions of such standards, and Transmit and receive video information more efficiently.

ビデオ圧縮技法では、ビデオシーケンスに固有の冗長性を低減または除去するために空間的予測および／または時間的予測を実行する。ブロックベースのビデオコーディングの場合、ビデオフレームまたはスライスがマクロブロックに区分され得る。各マクロブロックはさらに区分され得る。イントラコード化（Ｉ）フレームまたはスライス中のマクロブロックは、近傍マクロブロックに関する空間的予測を使用して符号化される。インターコード化（ＰまたはＢ）フレームまたはスライス中のマクロブロックは、同じフレームまたはスライス中の近傍マクロブロックに関する空間的予測、あるいは他の参照フレームに関する時間的予測を使用し得る。 Video compression techniques perform spatial prediction and / or temporal prediction to reduce or remove redundancy inherent in video sequences. For block-based video coding, a video frame or slice may be partitioned into macroblocks. Each macroblock may be further partitioned. Macroblocks in an intra-coded (I) frame or slice are encoded using spatial prediction on neighboring macroblocks. Macroblocks in an inter-coded (P or B) frame or slice may use spatial prediction for neighboring macroblocks in the same frame or slice, or temporal prediction for other reference frames.

ビデオデータが符号化された後、ビデオデータは送信または記憶のためにマルチプレクサによってパケット化され得る。ＭＰＥＧ−２は、多くのビデオ符号化規格のためのトランスポートレベルを定義する「システム」セクションを含む。ＭＰＥＧ−２トランスポートレベルシステムは、ＭＰＥＧ−２ビデオエンコーダ、または異なるビデオ符号化規格に準拠する他のビデオエンコーダによって使用され得る。たとえば、ＭＰＥＧ−４は、ＭＰＥＧ−２の符号化および復号方法とは異なる符号化および復号方法を規定しているが、ＭＰＥＧ−４規格の技法を実装するビデオエンコーダは、依然としてＭＰＥＧ−２トランスポートレベル方法を利用し得る。概して、「ＭＰＥＧ−２システム」への言及は、ＭＰＥＧ−２によって規定されているビデオデータのトランスポートレベルを指す。本開示では、ＭＰＥＧ−２によって規定されているトランスポートレベルは、「ＭＰＥＧ−２トランスポートストリーム」または単に「トランスポートストリーム」とも呼ばれる。同様に、ＭＰＥＧ−２システムのトランスポートレベルはプログラムストリームをも含む。トランスポートストリームおよびプログラムストリームは、概して、同様のデータを配信するための異なるフォーマットを含み、トランスポートストリームは、オーディオデータとビデオデータの両方を含む１つまたは複数の「プログラム」を備え、プログラムストリームは、オーディオデータとビデオデータの両方を含む１つのプログラムを含む。 After the video data is encoded, the video data can be packetized by a multiplexer for transmission or storage. MPEG-2 includes a “system” section that defines transport levels for many video coding standards. The MPEG-2 transport level system can be used by MPEG-2 video encoders, or other video encoders that conform to different video encoding standards. For example, MPEG-4 defines a different encoding and decoding method than the MPEG-2 encoding and decoding method, but video encoders that implement the techniques of the MPEG-4 standard are still MPEG-2 transport. A level method can be used. In general, reference to “MPEG-2 system” refers to the transport level of video data defined by MPEG-2. In the present disclosure, the transport level defined by MPEG-2 is also referred to as “MPEG-2 transport stream” or simply “transport stream”. Similarly, the transport level of the MPEG-2 system also includes a program stream. The transport stream and program stream generally include different formats for delivering similar data, and the transport stream comprises one or more “programs” that include both audio and video data, and the program stream Includes one program that includes both audio and video data.

Ｈ．２６４／ＡＶＣに基づく新しいビデオコーディング規格を開発するための取り組みが行われている。１つのそのような規格は、Ｈ．２６４／ＡＶＣのスケーラブル拡張であるスケーラブルビデオコーディング（ＳＶＣ）規格である。別の規格は、Ｈ．２６４／ＡＶＣのマルチビュー拡張になるマルチビュービデオコーディング（ＭＶＣ）である。ＭＰＥＧ−２システム仕様には、デジタル送信または記憶に好適な単一のデータストリームを形成するために、圧縮マルチメディア（ビデオおよびオーディオ）データストリームが他のデータとともにどのように多重化され得るかが記載されている。ＭＰＥＧ−２システムの最新仕様は、「Information Technology - Generic Coding of Moving Pictures and Associated Audio: Systems, Recommendation H.222.0; International Organisation for Standardisation, ISO/IEC JTC1/SC29/WG11; Coding of Moving Pictures and Associated Audio」、２００６年５月、において規定されている。ＭＰＥＧは、最近、ＭＰＥＧ−２システム上でのＭＶＣのトランスポート規格を設計しており、この仕様の最新バージョンは、「Study of ISO/IEC 13818-1:2007/FPDAM4 Transport of MVC」、ＭＰＥＧｄｏｃ．Ｎ１０５７２、ＭＰＥＧｏｆＩＳＯ／ＩＥＣＪＴＣ１／ＳＣ２９／ＷＧ１１、米国ハワイ州マウイ、２００９年４月である。 H. Efforts are being made to develop new video coding standards based on H.264 / AVC. One such standard is H.264. It is a scalable video coding (SVC) standard that is a scalable extension of H.264 / AVC. Another standard is H.264. H.264 / AVC multi-view video coding (MVC). The MPEG-2 system specification describes how a compressed multimedia (video and audio) data stream can be multiplexed with other data to form a single data stream suitable for digital transmission or storage. Have been described. The latest specification of MPEG-2 system is `` Information Technology-Generic Coding of Moving Pictures and Associated Audio: Systems, Recommendation H.222.0; International Organization for Standardisation, ISO / IEC JTC1 / SC29 / WG11; Coding of Moving Pictures and Associated Audio ", May 2006.". MPEG has recently designed a transport standard for MVC over the MPEG-2 system. The latest version of this specification is "Study of ISO / IEC 13818-1: 2007 / FPDAM4 Transport of MVC", MPEG doc. . N10572, MPEG of ISO / IEC JTC1 / SC29 / WG11, Maui, Hawaii, USA, April 2009.

ＭＶＣの最新の共同ドラフトは、http://wftp3.itu.int/av-arch/jvt-site/2008_07_Hannover/JVT-AB204.zipにおいて入手可能な、ＪＶＴ−ＡＢ２０４、「Joint Draft 8.0 on Multiview Video Coding」、第２８回ＪＶＴ会議、ドイツ、ハノーバー、２００８年７月に記載されている。ＡＶＣ規格に組み込まれた以後のバージョンは、http://wftp3.itu.int/av-arch/jvt-site/2009_01_Geneva/JVT-AD007.zipにおいて入手可能な、ＪＶＴ−ＡＤ００７、「Editors' draft revision to ITU-T Rec. H.264 | ISO/IEC 14496-10 Advanced Video Coding - in preparation for ITU-T SG 16 AAP Consent (in integrated form)」、第３０回ＪＶＴ会議、スイス、ジュネーブ、２００９年２月に記載されている。 The latest joint draft of MVC is JVT-AB204, “Joint Draft 8.0 on Multiview Video Coding, available at http://wftp3.itu.int/av-arch/jvt-site/2008_07_Hannover/JVT-AB204.zip ", 28th JVT Conference, Hannover, Germany, July 2008." Later versions incorporated into the AVC standard are JVT-AD007, “Editors' draft revision, available at http://wftp3.itu.int/av-arch/jvt-site/2009_01_Geneva/JVT-AD007.zip to ITU-T Rec. H.264 | ISO / IEC 14496-10 Advanced Video Coding-in preparation for ITU-T SG 16 AAP Consent (in integrated form) ”, 30th JVT Conference, Geneva, Switzerland, 2009 2 Listed in the month.

概して、本開示は、メディアエクストラクタトラックを形成するために、マルチトラックビデオデータフォーマットにおいてメディアエクストラクタを使用するための技法について説明する。本開示は、１つまたは複数の潜在的に非連続のネットワークアクセスレイヤ（ＮＡＬ）ユニットを参照することが可能であるエクストラクタを利用するために、国際標準化機構（ＩＳＯ）ベースメディアフォーマットを変更する。そのようなエクストラクタは、ＩＳＯベースメディアフォーマットファイルの任意のトラック中に存在し得る。本開示はまた、フレームレート値をトラック選択ボックスの属性として含めるための、第３世代パートナーシッププロジェクト（３ＧＰＰ）ファイルフォーマットの変更について説明する。本開示は、ＩＳＯベースメディアフォーマットへのマルチビュービデオコーディング（ＭＶＣ）拡張に関して、ＭＶＣ動作点の効率的な抽出をサポートするエクストラクタの使用についてさらに説明する。 In general, this disclosure describes techniques for using a media extractor in a multi-track video data format to form a media extractor track. The present disclosure modifies the International Standards Organization (ISO) base media format to utilize an extractor that can reference one or more potentially non-contiguous network access layer (NAL) units. . Such an extractor may be present in any track of the ISO base media format file. The present disclosure also describes a third generation partnership project (3GPP) file format change to include a frame rate value as an attribute of the track selection box. This disclosure further describes the use of extractors that support efficient extraction of MVC operating points for multi-view video coding (MVC) extensions to the ISO base media format.

一例では、ビデオデータを符号化するための方法が、ソースビデオデバイスによって、符号化されたビデオデータに基づいて、複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含む第１のトラックを構築することであって、ビデオサンプルがアクセスユニット中に含まれる、第１のトラックを構築することと、ソースビデオデバイスによって、第１のトラックのビデオサンプル中の複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含む第２のトラックを構築することであって、複数のＮＡＬユニットのうちの少なくとも１つが第１の識別されたＮＡＬユニットを備え、エクストラクタがアクセスユニットの第２のＮＡＬユニットを識別し、第１の識別されたＮＡＬユニットと第２の識別されたＮＡＬユニットとが非連続である、第２のトラックを構築することと、第１のトラックと第２のトラックとを、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイル中に含めることと、ビデオファイルを出力することとを含む。 In one example, a method for encoding video data constructs a first track that includes video samples comprising a plurality of network access layer (NAL) units based on video data encoded by a source video device. At least one of a plurality of NAL units in the video samples of the first track by constructing a first track in which the video samples are included in the access unit and by the source video device Constructing a second track that includes an extractor identifying at least one of the plurality of NAL units comprising a first identified NAL unit, wherein the extractor is a second NAL of the access unit Identify the unit, the first identified NAL unit and the second Constructing a second track that is non-contiguous with the separated NAL units, and at least partially compliant with the International Organization for Standardization (ISO) base media file format for the first track and the second track Including in the video file to be output and outputting the video file.

別の例では、ビデオデータを符号化するための装置が、ビデオデータを符号化するように構成されたエンコーダと、符号化されたビデオデータに基づいて、複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含む第１のトラックを構築することであって、ビデオサンプルがアクセスユニット中に含まれる、第１のトラックを構築することと、第１のトラックのビデオサンプル中の複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含む第２のトラックを構築することであって、複数のＮＡＬユニットのうちの少なくとも１つが第１の識別されたＮＡＬユニットを備え、エクストラクタがアクセスユニットの第２のＮＡＬユニットを識別し、第１の識別されたＮＡＬユニットと第２の識別されたＮＡＬユニットとが非連続である、第２のトラックを構築することと、第１のトラックと第２のトラックとを、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイル中に含めることとを行うように構成されたマルチプレクサと、ビデオファイルを出力するように構成された出力インターフェースとを含む。 In another example, an apparatus for encoding video data includes an encoder configured to encode video data and a plurality of network access layer (NAL) units based on the encoded video data. Constructing a first track comprising video samples comprising constructing a first track, wherein the video samples are included in an access unit, and a plurality of NAL units in the video samples of the first track Constructing a second track that includes an extractor that identifies at least one of the at least one of the plurality of NAL units comprising the first identified NAL unit, wherein the extractor has access Identifies the second NAL unit of the unit, the first identified NAL unit and the second identification Building a second track that is non-contiguous with the recorded NAL units, and at least partially compliant with the International Organization for Standardization (ISO) base media file format for the first track and the second track A multiplexer configured to include in the video file and an output interface configured to output the video file.

別の例では、ビデオデータを符号化するための装置が、符号化されたビデオデータに基づいて、複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含む第１のトラックを構築するための手段であって、ビデオサンプルがアクセスユニット中に含まれる、第１のトラックを構築するための手段と、第１のトラックのビデオサンプル中の複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含む第２のトラックを構築するための手段であって、複数のＮＡＬユニットのうちの少なくとも１つが第１の識別されたＮＡＬユニットを備え、エクストラクタがアクセスユニットの第２のＮＡＬユニットを識別し、第１の識別されたＮＡＬユニットと第２の識別されたＮＡＬユニットとが非連続である、第２のトラックを構築するための手段と、第１のトラックと第２のトラックとを、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイル中に含めるための手段と、ビデオファイルを出力するための手段とを含む。 In another example, an apparatus for encoding video data is for constructing a first track that includes video samples comprising a plurality of network access layer (NAL) units based on the encoded video data. Means for constructing a first track, wherein the video sample is included in the access unit, and an extra identifying at least one of the plurality of NAL units in the video sample of the first track Means for constructing a second track comprising a Kuta, wherein at least one of the plurality of NAL units comprises a first identified NAL unit, and the extractor comprises a second NAL unit of the access unit. Identify and the first identified NAL unit and the second identified NAL unit are non-contiguous Means for constructing a second track; means for including the first track and the second track in a video file at least partially compliant with an International Organization for Standardization (ISO) base media file format; And means for outputting a video file.

別の例では、コンピュータ可読記憶媒体が、実行されると、符号化されたビデオデータに基づいて、複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含む第１のトラックを構築することであって、ビデオサンプルがアクセスユニット中に含まれる、第１のトラックを構築することと、第１のトラックのビデオサンプル中の複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含む第２のトラックを構築することであって、複数のＮＡＬユニットのうちの少なくとも１つが第１の識別されたＮＡＬユニットを備え、エクストラクタがアクセスユニットの第２のＮＡＬユニットを識別し、第１の識別されたＮＡＬユニットと第２の識別されたＮＡＬユニットとが非連続である、第２のトラックを構築することと、第１のトラックと第２のトラックとを、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイル中に含めることと、ビデオファイルを出力することとをソースデバイスのプロセッサに行わせる命令を備える。 In another example, a computer-readable storage medium, when executed, constructs a first track that includes video samples comprising a plurality of network access layer (NAL) units based on encoded video data. A first sample including an extractor for identifying at least one of a plurality of NAL units in the video sample of the first track, and constructing a first track in which the video sample is included in the access unit; Constructing two tracks, wherein at least one of the plurality of NAL units comprises a first identified NAL unit, an extractor identifies a second NAL unit of the access unit, A second track in which the identified NAL unit and the second identified NAL unit are discontinuous. , Including a first track and a second track in a video file that is at least partially compliant with the International Organization for Standardization (ISO) base media file format, and outputting the video file Includes an instruction for causing the processor of the source device to execute.

別の例では、ビデオデータを復号するための方法が、宛先デバイスのデマルチプレクサによって、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイルを受信することであって、ビデオファイルが第１のトラックと第２のトラックとを備え、第１のトラックが、符号化されたビデオデータに対応する複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含み、ビデオサンプルがアクセスユニット中に含まれ、第２のトラックが、第１のトラックの複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含み、複数のＮＡＬユニットのうちの少なくとも１つが第１の識別されたＮＡＬユニットを備え、エクストラクタがアクセスユニットの第２のＮＡＬユニットを識別し、第１の識別されたＮＡＬユニットと第２の識別されたＮＡＬユニットとが非連続である、受信することと、復号されるべき第２のトラックを選択することと、第２のトラックのエクストラクタによって識別された第１のＮＡＬユニットおよび第２のＮＡＬユニットの符号化されたビデオデータを宛先デバイスのビデオデコーダに送ることとを含む。 In another example, a method for decoding video data is to receive, by a demultiplexer of a destination device, a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format, The file comprises a first track and a second track, the first track comprising a video sample comprising a plurality of network access layer (NAL) units corresponding to the encoded video data, the video sample being accessed Included in the unit, the second track includes an extractor that identifies at least one of the plurality of NAL units of the first track, and at least one of the plurality of NAL units is first identified. NAL unit and extractor is an access unit Identifying a second NAL unit and receiving the first identified NAL unit and the second identified NAL unit are non-contiguous and selecting a second track to be decoded And sending the encoded video data of the first NAL unit and the second NAL unit identified by the extractor of the second track to the video decoder of the destination device.

別の例では、ビデオデータを復号するための装置が、ビデオデータを復号するように構成されたビデオデコーダと、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイルを受信することであって、ビデオファイルが第１のトラックと第２のトラックとを備え、第１のトラックが、符号化されたビデオデータに対応する複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含み、ビデオサンプルがアクセスユニット中に含まれ、第２のトラックが、第１のトラックの複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含み、複数のＮＡＬユニットのうちの少なくとも１つが第１の識別されたＮＡＬユニットを備え、エクストラクタがアクセスユニットの第２のＮＡＬユニットを識別し、第１の識別されたＮＡＬユニットと第２の識別されたＮＡＬユニットとが非連続である、受信することと、復号されるべき第２のトラックを選択することと、第２のトラックのエクストラクタによって識別された第１のＮＡＬユニットおよび第２のＮＡＬユニットの符号化されたビデオデータをビデオデコーダに送ることとを行うように構成されたデマルチプレクサとを含む。 In another example, an apparatus for decoding video data receives a video decoder configured to decode video data and a video file that is at least partially compliant with an International Organization for Standardization (ISO) base media file format. A video sample comprising a first track and a second track, wherein the first track comprises a plurality of network access layer (NAL) units corresponding to the encoded video data. Wherein the video sample is included in the access unit, the second track includes an extractor identifying at least one of the plurality of NAL units of the first track, and at least one of the plurality of NAL units One with the first identified NAL unit and the extractor A second NAL unit of the access unit is identified, the first identified NAL unit and the second identified NAL unit are non-contiguous, and receiving a second track to be decoded A demultiplexer configured to select and send the encoded video data of the first NAL unit and the second NAL unit identified by the extractor of the second track to the video decoder Including.

別の例では、ビデオデータを復号するための装置が、宛先デバイスのデマルチプレクサによって、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイルを受信するための手段であって、ビデオファイルが第１のトラックと第２のトラックとを備え、第１のトラックが、符号化されたビデオデータに対応する複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含み、ビデオサンプルがアクセスユニット中に含まれ、第２のトラックが、第１のトラックの複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含み、複数のＮＡＬユニットのうちの少なくとも１つが第１の識別されたＮＡＬユニットを備え、エクストラクタがアクセスユニットの第２のＮＡＬユニットを識別し、第１の識別されたＮＡＬユニットと第２の識別されたＮＡＬユニットとが非連続である、受信するための手段と、復号されるべき第２のトラックを選択するための手段と、第２のトラックのエクストラクタによって識別された第１のＮＡＬユニットおよび第２のＮＡＬユニットの符号化されたビデオデータを宛先デバイスのビデオデコーダに送るための手段とを含む。 In another example, an apparatus for decoding video data is a means for receiving a video file at least partially compliant with an International Organization for Standardization (ISO) base media file format by a demultiplexer of a destination device. The video file comprises a first track and a second track, the first track comprising a video sample comprising a plurality of network access layer (NAL) units corresponding to the encoded video data; Is included in the access unit, the second track includes an extractor identifying at least one of the plurality of NAL units of the first track, and at least one of the plurality of NAL units is the first With the identified NAL unit, the extractor can access Means for receiving, wherein the first identified NAL unit and the second identified NAL unit are non-contiguous, and the second NAL unit to be decoded Means for selecting a track; means for sending the encoded video data of the first NAL unit and the second NAL unit identified by the extractor of the second track to a video decoder of the destination device; including.

別の例では、コンピュータ可読記憶媒体が、実行されると、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイルを受信したとき、復号されるべき第２のトラックを選択することであって、ビデオファイルが第１のトラックと第２のトラックとを備え、第１のトラックが、符号化されたビデオデータに対応する複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含み、ビデオサンプルがアクセスユニット中に含まれ、第２のトラックが、第１のトラックの複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含み、複数のＮＡＬユニットのうちの少なくとも１つが第１の識別されたＮＡＬユニットを備え、エクストラクタがアクセスユニットの第２のＮＡＬユニットを識別し、第１の識別されたＮＡＬユニットと第２の識別されたＮＡＬユニットとが非連続である、第２のトラックを選択することと、第２のトラックのエクストラクタによって識別された第１のＮＡＬユニットおよび第２のＮＡＬユニットの符号化されたビデオデータをビデオデコーダに送ることとを宛先デバイスのプロセッサに行わせる命令で符号化される。 In another example, a computer-readable storage medium, when executed, selects a second track to be decoded when receiving a video file that conforms at least in part to the International Organization for Standardization (ISO) base media file format. A video sample comprising a first track and a second track, wherein the first track comprises a plurality of network access layer (NAL) units corresponding to the encoded video data. Wherein the video sample is included in the access unit, the second track includes an extractor identifying at least one of the plurality of NAL units of the first track, and at least one of the plurality of NAL units One with the first identified NAL unit and the extractor has access Identifying a second NAL unit of the unit, selecting a second track in which the first identified NAL unit and the second identified NAL unit are non-contiguous; and Encoded with instructions that cause the destination device processor to send the encoded video data of the first NAL unit and the second NAL unit identified by the extractor to the video decoder.

１つまたは複数の例の詳細は、添付の図面および以下の説明に記載されている。他の特徴、目的、および利点は、説明および図面、ならびに特許請求の範囲から明らかになるであろう。 The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.

オーディオ／ビデオ（Ａ／Ｖ）ソースデバイスがオーディオおよびビデオデータをＡ／Ｖ宛先デバイスにトランスポートする例示的なシステムを示すブロック図。1 is a block diagram illustrating an example system in which an audio / video (A / V) source device transports audio and video data to an A / V destination device. マルチプレクサの構成要素の例示的な構成を示すブロック図。The block diagram which shows the example structure of the component of a multiplexer. ビデオサンプルのセットを有する第１のトラックと、第１のトラックのビデオサンプルのサブセットを参照するエクストラクタを有する第２のトラックとを含む例示的なファイルを示すブロック図。FIG. 3 is a block diagram illustrating an example file that includes a first track having a set of video samples and a second track having an extractor that references a subset of the video samples of the first track. ２つの別個のエクストラクタトラックを含む別の例示的なファイルを示すブロック図。FIG. 4 is a block diagram illustrating another example file that includes two separate extractor tracks. サブセットトラックと２つのメディアエクストラクタトラックとを含む別の例示的なファイルを示すブロック図。FIG. 4 is a block diagram illustrating another example file that includes a subset track and two media extractor tracks. 様々なメディアエクストラクタトラックのためのメディアエクストラクタの例を含むファイルのメディアデータボックスの例を示すブロック図。FIG. 3 is a block diagram illustrating an example media data box of a file that includes examples of media extractors for various media extractor tracks. 様々なメディアエクストラクタトラックのためのメディアエクストラクタの例を含むファイルのメディアデータボックスの例を示すブロック図。FIG. 3 is a block diagram illustrating an example media data box of a file that includes examples of media extractors for various media extractor tracks. 様々なメディアエクストラクタトラックのためのメディアエクストラクタの例を含むファイルのメディアデータボックスの例を示すブロック図。FIG. 3 is a block diagram illustrating an example media data box of a file that includes examples of media extractors for various media extractor tracks. 例示的なＭＶＣ予測パターンを示す概念図。The conceptual diagram which shows an example MVC prediction pattern. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図。FIG. 7 is a block diagram illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. トラック選択ボックスの追加の属性をシグナリングする例示的な変更された第３世代パートナーシッププロジェクト（３ＧＰＰ）トラック選択ボックスを示すブロック図。FIG. 4 is a block diagram illustrating an exemplary modified third generation partnership project (3GPP) track selection box that signals additional attributes of the track selection box. 本開示の技法による、メディアエクストラクタを使用するための例示的な方法を示すフローチャート。6 is a flowchart illustrating an example method for using a media extractor in accordance with the techniques of this disclosure.

本開示の技法は、一般に、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットと、ＩＳＯベースメディアファイルフォーマットの拡張とを向上させることを対象とする。ＩＳＯベースメディアファイルフォーマットの拡張は、たとえば、アドバンストビデオコーディング（ＡＶＣ）、スケーラブルビデオコーディング（ＳＶＣ：scalable video coding）、マルチビュービデオコーディング（ＭＶＣ：multiview video coding）、および第３世代パートナーシッププロジェクト（３ＧＰＰ：Third Generation Partnership Project）ファイルフォーマットを含む。概して、本開示の技法は、ＩＳＯベースメディアファイルフォーマットおよび／またはＩＳＯベースメディアファイルフォーマットの拡張でメディアエクストラクタトラックを生成するために使用され得る。以下でより詳細に説明するように、そのようなメディアエクストラクタトラックは、いくつかの例では、ハイパーテキストトランスポートプロトコル（ＨＴＴＰ）ビデオストリーミングにおける適応をサポートするために使用され得る。いくつかの例では、メディアエクストラクタは、新しいメディアエクストラクタトラックを形成するための別のトラックのサンプル全体を抽出するために、ＩＳＯベースメディアファイルフォーマットおよび／またはＩＳＯベースメディアファイルフォーマットの拡張（たとえば、ＡＶＣ、ＳＶＣ、ＭＶＣ、および３ＧＰＰ）の一部を形成する。 The techniques of this disclosure are generally directed to improving the International Organization for Standardization (ISO) base media file format and extensions of the ISO base media file format. ISO base media file format extensions include, for example, advanced video coding (AVC), scalable video coding (SVC), multiview video coding (MVC), and third generation partnership project (3GPP: Third Generation Partnership Project) file format included. In general, the techniques of this disclosure may be used to generate a media extractor track with an ISO base media file format and / or an extension of the ISO base media file format. As described in more detail below, such media extractor tracks may be used in some examples to support adaptation in Hypertext Transport Protocol (HTTP) video streaming. In some examples, the media extractor may extend the ISO base media file format and / or the ISO base media file format to extract a whole sample of another track to form a new media extractor track (eg, , AVC, SVC, MVC, and 3GPP).

これらの技法は、ＭＰＥＧ−２（ＭｏｔｉｏｎＰｉｃｔｕｒｅＥｘｐｅｒｔｓＧｒｏｕｐ）システム、すなわち、トランスポートレベル細部に関してＭＰＥＧ−２に準拠するシステムによって使用され得る。ＭＰＥＧ−４は、たとえば、ビデオ符号化のための規格を与えるが、概して、ＭＰＥＧ−４規格に準拠するビデオエンコーダはＭＰＥＧ−２トランスポートレベルシステムを利用すると仮定する。したがって、本開示の技法は、ＭＰＥＧ−２、ＭＰＥＧ−４、ＩＴＵ−ＴＨ．２６３、ＩＴＵ−ＴＨ．２６４／ＭＰＥＧ−４、あるいはＭＰＥＧ−２トランスポートストリームおよび／またはプログラムストリームを利用する任意の他のビデオ符号化規格に準拠するビデオエンコーダに適用可能である。 These techniques may be used by a Motion Picture Experts Group (MPEG-2) system, ie, a system that conforms to MPEG-2 with respect to transport level details. MPEG-4, for example, provides a standard for video coding, but in general it is assumed that video encoders compliant with the MPEG-4 standard utilize the MPEG-2 transport level system. Accordingly, the techniques of this disclosure are MPEG-2, MPEG-4, ITU-T H.264. 263, ITU-TH. The present invention is applicable to video encoders conforming to H.264 / MPEG-4, or any other video encoding standard that utilizes MPEG-2 transport streams and / or program streams.

ＩＳＯベースメディアファイルフォーマットは、１つまたは複数のトラックを含むファイルを規定している。ＩＳＯベースメディアファイルフォーマット規格は、関連するサンプルの時限シーケンスとしてトラックを定義している。ＩＳＯベースメディアファイルフォーマット規格は、単一のタイムスタンプに関連するデータとしてサンプルを定義し、ビデオの個々のフレーム、復号順序での一連のビデオフレーム、または復号順序でのオーディオの圧縮セクションとしてサンプルの例を与えている。ヒントトラックと呼ばれる特殊なトラックは、メディアデータを含んでいないが、代わりに１つまたは複数のトラックをストリーミングチャネルにパッケージングするための命令を含んでいる。ＩＳＯベースメディアファイルフォーマット規格は、ヒントトラックにおいて、サンプルが１つまたは複数のストリーミングパケットの形成を定義することに言及している。 The ISO base media file format defines a file that includes one or more tracks. The ISO base media file format standard defines a track as a timed sequence of related samples. The ISO base media file format standard defines a sample as data related to a single time stamp, and samples as a video individual frame, a series of video frames in decoding order, or a compressed section of audio in decoding order. An example is given. Special tracks called hint tracks do not contain media data, but instead contain instructions for packaging one or more tracks into a streaming channel. The ISO base media file format standard mentions that in a hint track, a sample defines the formation of one or more streaming packets.

本開示の技法は、メディアエクストラクタトラックの生成を可能にする。メディアエクストラクタトラックは、概して１つまたは複数のエクストラクタを含み得る。メディアエクストラクタトラック中のエクストラクタは、別のトラックのサンプルを識別し、抽出するために使用される。このようにして、メディアエクストラクタトラック中のメディアエクストラクタは、デリファレンスされたときに、別のトラックからサンプルを検索するポインタと考えられ得る。ＳＶＣのエクストラクタとは異なり、たとえば、本開示のエクストラクタは、別のトラックの１つまたは複数の潜在的な非連続ネットワークアクセスレイヤ（ＮＡＬ）ユニットを参照することができる。本開示の技法によれば、代替グループを形成するために、メディアエクストラクタトラック、１つまたは複数のメディアエクストラクタを含んでいるトラック、およびメディアエクストラクタを含まない他のトラックが互いにグループ化され得る。 The techniques of this disclosure allow for the generation of media extractor tracks. A media extractor track may generally include one or more extractors. An extractor in a media extractor track is used to identify and extract another track sample. In this way, media extractors in a media extractor track can be thought of as pointers that retrieve samples from another track when dereferenced. Unlike an SVC extractor, for example, an extractor of the present disclosure may reference one or more potential non-contiguous network access layer (NAL) units on another track. According to the techniques of this disclosure, a media extractor track, a track that includes one or more media extractors, and other tracks that do not include a media extractor are grouped together to form an alternate group. obtain.

本開示では、同じトラック中で連続して発生する２つ以上のＮＡＬユニットを説明するために、ＮＡＬユニットに関して「連続する」という用語を使用する。すなわち、２つのＮＡＬユニットが連続するとき、そのＮＡＬユニットのうちの１つにおけるデータの最後のバイトは、同じトラック中の別のＮＡＬユニットのデータの第１のバイトの直前にくる。同じアクセスユニット中の２つのＮＡＬユニットは、概して、２つのＮＡＬユニットが同じトラック内で、あるデータ量だけ分離されている場合、または一方のＮＡＬユニットが１つのトラック中に発生し、他方のＮＡＬユニットが異なるトラック中に発生する場合のいずれかにおいて、「非連続である」と考えられる。本開示の技法は、アクセスユニットの２つ以上の非連続ＮＡＬユニットを識別し得るエクストラクタを提供する。 In this disclosure, the term “consecutive” is used with respect to NAL units to describe two or more NAL units that occur in succession in the same track. That is, when two NAL units are consecutive, the last byte of data in one of the NAL units comes immediately before the first byte of data of another NAL unit in the same track. Two NAL units in the same access unit generally have two NAL units separated by a certain amount of data in the same track, or one NAL unit occurs in one track and the other NAL In any case where the units occur in different tracks, they are considered “discontinuous”. The techniques of this disclosure provide an extractor that can identify two or more non-contiguous NAL units of an access unit.

その上、本開示のエクストラクタは、ＳＶＣに限定されないが、概してＩＳＯベースメディアファイルフォーマット、または、たとえば、ＡＶＣ、ＳＶＣ、またはＭＶＣなどのＩＳＯベースメディアファイルフォーマットの他の拡張中に含まれ得る。本開示のエクストラクタはまた、第３世代パートナーシッププロジェクト（３ＧＰＰ）ファイルフォーマット中に含まれ得る。本開示は、さらに、トラック選択ボックスの属性としてフレームレートを明示的にシグナリングするために、３ＧＰＰファイルフォーマットを変更することを可能にする。 Moreover, the extractors of the present disclosure are not limited to SVC, but may generally be included in ISO base media file formats or other extensions of ISO base media file formats such as, for example, AVC, SVC, or MVC. The extractors of the present disclosure may also be included in a third generation partnership project (3GPP) file format. The present disclosure further allows the 3GPP file format to be changed to explicitly signal the frame rate as an attribute of the track selection box.

メディアエクストラクタトラックは、たとえば、動作点の抽出をサポートするためにＭＶＣファイルフォーマット中で使用され得る。サーバデバイスは、ＭＰＥＧ−２トランスポートレイヤビットストリーム中に様々な動作点を与え得、その各々はマルチビュービデオコーディングビデオデータの特定のビューのそれぞれのサブセットに対応する。すなわち、動作点は、概して、ビットストリームのビューのサブセットに対応する。いくつかの例では、動作点の各ビューは、同じフレームレートのビデオデータを含む。本開示の技法によれば、動作点は、他のトラックのビデオデータと、他のトラック中に含まれない潜在的に追加のサンプルとを参照する１つまたは複数のエクストラクタを含むメディアエクストラクタトラックを使用して表され得る。 The media extractor track can be used, for example, in the MVC file format to support operating point extraction. The server device may provide various operating points in the MPEG-2 transport layer bitstream, each of which corresponds to a respective subset of a particular view of multiview video coding video data. That is, the operating point generally corresponds to a subset of the view of the bitstream. In some examples, each view of the operating point includes video data at the same frame rate. In accordance with the techniques of this disclosure, an operating point is a media extractor that includes one or more extractors that reference video data of other tracks and potentially additional samples not included in other tracks. It can be represented using a track.

このようにして、各動作点は、共通のフレームレートをもつビューのサブセットを出力するために、動作点を復号するために要求される必要なＮＡＬユニットのみを含み得る。エクストラクタトラックとＭＶＣビデオの全表現との組合せは、ＭＶＣ表現のプレイリストを形成し得る。本開示のメディアエクストラクタトラックの使用は、たとえば、様々なビットレートが時間スケーラビリティから生じる動作点について、動作点選択およびスイッチングをサポートし得る。 In this way, each operating point may contain only the necessary NAL units required to decode the operating point to output a subset of views with a common frame rate. The combination of the extractor track and the full representation of the MVC video may form a playlist of MVC representations. The use of the media extractor track of the present disclosure may support operating point selection and switching, for example, for operating points where various bit rates result from temporal scalability.

また、本開示のメディアエクストラクタトラックは、代替グループまたはスイッチグループを形成するために使用され得る。すなわち、ＩＳＯベースメディアファイルフォーマットでは、代替グループを形成するために、トラックが互いにグループ化され得る。ＩＳＯベースメディアファイルフォーマットの例では、代替グループのトラックは、概して、いつでも代替グループのトラックのうちの１つしか再生またはストリーミングされないように、互いの存立可能な代替を形成する。代替グループのトラックは、たとえば、ビットレート、コーデック、言語、パケットサイズ、または他の特性などの属性を介して、代替グループの他のトラックとは区別可能であるべきである。本開示の技法は、代替グループを形成するために、メディアエクストラクタトラック、メディアエクストラクタを含んでいるトラック、および／または他の通常のビデオトラックをグループ化することを可能にする。ＭＶＣに準拠する例では、各トラックはそれぞれの動作点に対応し得る。すなわち、ＭＶＣにおける各動作点は、トラックのうちの特定の１つ、たとえば、メディアエクストラクタトラック、またはメディアエクストラクタを含まないトラックのいずれかによって表され得る。同じ代替グループ中の１つのトラックは、一般に、利用可能な帯域幅に適応するために、プログレッシブダウンロードのために選択される。 The media extractor tracks of the present disclosure can also be used to form alternative groups or switch groups. That is, in the ISO base media file format, tracks can be grouped together to form alternative groups. In the example of an ISO base media file format, alternate group tracks generally form a viable alternative to each other so that only one of the alternate group tracks is played or streamed at any given time. The tracks in the substitution group should be distinguishable from other tracks in the substitution group, for example via attributes such as bit rate, codec, language, packet size, or other characteristics. The techniques of this disclosure allow media extractor tracks, tracks containing media extractors, and / or other regular video tracks to be grouped to form alternative groups. In an example conforming to MVC, each track may correspond to a respective operating point. That is, each operating point in MVC may be represented by a particular one of the tracks, eg, either a media extractor track or a track that does not include a media extractor. One track in the same alternate group is generally selected for progressive download to accommodate available bandwidth.

同様に、メディアエクストラクタトラックおよび他のトラックは、３ＧＰＰファイルフォーマットでのスイッチグループを形成するために互いにグループ化され得、ＨＴＴＰストリーミングアプリケーションにおいて帯域幅とデコーダ能力とを適応するためのトラック選択のために使用され得る。３ＧＰＰファイルフォーマットは、トラックのスイッチグループの定義を与える。スイッチグループ中のトラックは同じ代替グループに属する。すなわち、３ＧＰＰファイルフォーマットによれば、同じスイッチグループ中のトラックは、セッション中に切り替えるために利用可能であるが、異なるスイッチグループ中のトラックは、切り替えるために利用可能ではない。 Similarly, media extractor tracks and other tracks can be grouped together to form a switch group in 3GPP file format, for track selection to adapt bandwidth and decoder capabilities in HTTP streaming applications. Can be used. The 3GPP file format provides a definition of a track switch group. The tracks in the switch group belong to the same alternative group. That is, according to the 3GPP file format, tracks in the same switch group are available for switching during a session, but tracks in different switch groups are not available for switching.

図１は、オーディオ／ビデオ（Ａ／Ｖ）ソースデバイス２０がオーディオおよびビデオデータをＡ／Ｖ宛先デバイス４０にトランスポートする例示的なシステム１０を示すブロック図である。Ａ／Ｖソースデバイス２０は「ソースビデオデバイス」と呼ばれることもある。図１のシステム１０は、ビデオ通信会議システム、サーバ／クライアントシステム、放送事業者／受信機システム、またはＡ／Ｖソースデバイス２０などのソースデバイスからＡ／Ｖ宛先デバイス４０などの宛先デバイスにビデオデータが送られる任意の他のシステムに対応し得る。Ａ／Ｖ宛先デバイス４０は、「宛先ビデオデバイス」または「クライアントデバイス」と呼ばれることもある。いくつかの例では、Ａ／Ｖソースデバイス２０およびＡ／Ｖ宛先デバイス４０は双方向情報交換を実行し得る。すなわち、Ａ／Ｖソースデバイス２０およびＡ／Ｖ宛先デバイス４０は、オーディオおよびビデオデータの符号化と復号（および、送信と受信）の両方が可能であり得る。いくつかの例では、オーディオエンコーダ２６は、ボコーダとも呼ばれるボイスエンコーダを備え得る。 FIG. 1 is a block diagram illustrating an exemplary system 10 in which an audio / video (A / V) source device 20 transports audio and video data to an A / V destination device 40. The A / V source device 20 may be referred to as a “source video device”. The system 10 of FIG. 1 is a video communication conferencing system, server / client system, broadcaster / receiver system, or video data from a source device such as an A / V source device 20 to a destination device such as an A / V destination device 40. Can correspond to any other system in which is sent. The A / V destination device 40 may be referred to as a “destination video device” or a “client device”. In some examples, A / V source device 20 and A / V destination device 40 may perform a bidirectional information exchange. That is, A / V source device 20 and A / V destination device 40 may be capable of both encoding and decoding (and transmitting and receiving) audio and video data. In some examples, the audio encoder 26 may comprise a voice encoder, also called a vocoder.

Ａ／Ｖソースデバイス２０は、図１の例では、オーディオソース２２とビデオソース２４とを備える。オーディオソース２２は、たとえば、オーディオエンコーダ２６によって符号化されるべき、キャプチャされたオーディオデータを表す電気信号を生成するマイクロフォンを備え得る。代替的に、オーディオソース２２は、前に記録されたオーディオデータを記憶する記憶媒体、コンピュータシンセサイザなどのオーディオデータジェネレータ、またはオーディオデータの任意の他のソースを備え得る。ビデオソース２４は、ビデオエンコーダ２８によって符号化されるべきビデオデータを生成するビデオカメラ、前に記録されたビデオデータで符号化された記憶媒体、ビデオデータ生成ユニット、またはビデオデータの任意の他のソースを備え得る。 In the example of FIG. 1, the A / V source device 20 includes an audio source 22 and a video source 24. Audio source 22 may comprise, for example, a microphone that generates an electrical signal representing captured audio data to be encoded by audio encoder 26. Alternatively, audio source 22 may comprise a storage medium that stores previously recorded audio data, an audio data generator such as a computer synthesizer, or any other source of audio data. Video source 24 may be a video camera that generates video data to be encoded by video encoder 28, a storage medium encoded with previously recorded video data, a video data generation unit, or any other video data May have a source.

未加工オーディオおよびビデオデータは、アナログまたはデジタルデータを備え得る。アナログデータは、オーディオエンコーダ２６および／またはビデオエンコーダ２８によって符号化される前にデジタル化され得る。オーディオソース２２は、通話参加者が話している間、通話参加者からオーディオデータを取得し得、同時に、ビデオソース２４は、通話参加者のビデオデータを取得し得る。他の例では、オーディオソース２２は、記憶されたオーディオデータを備えるコンピュータ可読記憶媒体を備え得、ビデオソース２４は、記憶されたビデオデータを備えるコンピュータ可読記憶媒体を備え得る。このようにして、本開示で説明する技法は、ライブ、ストリーミング、リアルタイムオーディオおよびビデオデータ、またはアーカイブされた、あらかじめ記録されたオーディオおよびビデオデータに適用され得る。 Raw audio and video data may comprise analog or digital data. The analog data may be digitized before being encoded by audio encoder 26 and / or video encoder 28. The audio source 22 may obtain audio data from the call participant while the call participant is speaking, and at the same time, the video source 24 may obtain the call participant's video data. In other examples, audio source 22 may comprise a computer readable storage medium comprising stored audio data and video source 24 may comprise a computer readable storage medium comprising stored video data. In this way, the techniques described in this disclosure may be applied to live, streaming, real-time audio and video data, or archived, pre-recorded audio and video data.

ビデオフレームに対応するオーディオフレームは、概して、ビデオフレーム内に含まれている、ビデオソース２４によってキャプチャされたビデオデータと同時にオーディオソース２２によってキャプチャされたオーディオデータを含んでいるオーディオフレームである。たとえば、通話参加者が概して話すことによってオーディオデータを生成する間、オーディオソース２２はオーディオデータをキャプチャし、同時に、すなわちオーディオソース２２がオーディオデータをキャプチャしている間、ビデオソース２４は通話参加者のビデオデータをキャプチャする。したがって、オーディオフレームは、１つまたは複数の特定のビデオフレームに時間的に対応し得る。したがって、ビデオフレームに対応するオーディオフレームは、概して、オーディオデータとビデオデータとが同時にキャプチャされる状況、およびオーディオフレームとビデオフレームとが、それぞれ、同時にキャプチャされたオーディオデータとビデオデータとを備える状況に対応する。 An audio frame corresponding to a video frame is generally an audio frame that includes audio data captured by the audio source 22 at the same time as the video data captured by the video source 24 that is contained within the video frame. For example, the audio source 22 captures audio data while the call participant generally generates audio data by speaking, and at the same time, ie, while the audio source 22 is capturing audio data, the video source 24 is the call participant. Capture video data. Thus, an audio frame may correspond temporally to one or more specific video frames. Thus, an audio frame corresponding to a video frame is generally a situation where audio data and video data are captured simultaneously, and a situation where the audio frame and video frame comprise simultaneously captured audio data and video data, respectively. Corresponding to

いくつかの例では、オーディオエンコーダ２６は、符号化オーディオフレームのオーディオデータが記録された時間を表す、各符号化オーディオフレームにおけるタイムスタンプを符号化し得、同様に、ビデオエンコーダ２８は、符号化ビデオフレームのビデオデータが記録された時間を表す、各符号化ビデオフレームにおけるタイムスタンプを符号化し得る。そのような例では、ビデオフレームに対応するオーディオフレームは、タイムスタンプを備えるオーディオフレームと同じタイムスタンプを備えるビデオフレームとを備え得る。Ａ／Ｖソースデバイス２０は、オーディオエンコーダ２６および／またはビデオエンコーダ２８がそこからタイムスタンプを生成し得るか、あるいはオーディオソース２２およびビデオソース２４がオーディオおよびビデオデータをそれぞれタイムスタンプに関連付けるために使用し得る、内部クロックを含み得る。 In some examples, audio encoder 26 may encode a time stamp in each encoded audio frame that represents the time at which the audio data of the encoded audio frame was recorded, and similarly, video encoder 28 may encode encoded video. A time stamp in each encoded video frame that represents the time at which the video data of the frame was recorded may be encoded. In such an example, the audio frame corresponding to the video frame may comprise a video frame with the same time stamp as the audio frame with the time stamp. A / V source device 20 may be used by audio encoder 26 and / or video encoder 28 to generate a time stamp therefrom or by audio source 22 and video source 24 to associate audio and video data with the time stamp, respectively. An internal clock may be included.

いくつかの例では、オーディオソース２２は、オーディオデータが記録された時間に対応するデータをオーディオエンコーダ２６に送り得、ビデオソース２４は、ビデオデータが記録された時間に対応するデータをビデオエンコーダ２８に送り得る。いくつかの例では、オーディオエンコーダ２６は、必ずしもオーディオデータが記録された絶対時刻を示すことなしに、符号化されたオーディオデータの相対的時間順序を示すために、符号化されたオーディオデータ中のシーケンス識別子を符号化し得、同様に、ビデオエンコーダ２８も、符号化されたビデオデータの相対的時間順序を示すためにシーケンス識別子を使用し得る。同様に、いくつかの例では、シーケンス識別子は、タイムスタンプにマッピングされるか、または場合によってはタイムスタンプと相関し得る。 In some examples, audio source 22 may send data corresponding to the time at which audio data was recorded to audio encoder 26, and video source 24 sends data corresponding to the time at which video data was recorded to video encoder 28. Can be sent to. In some examples, the audio encoder 26 may be included in the encoded audio data to indicate the relative time order of the encoded audio data without necessarily indicating the absolute time at which the audio data was recorded. The sequence identifier may be encoded, and similarly, video encoder 28 may use the sequence identifier to indicate the relative time order of the encoded video data. Similarly, in some examples, the sequence identifier may be mapped to a timestamp or possibly correlated with a timestamp.

本開示の技法は、概して、符号化マルチメディア（たとえば、オーディオおよびビデオ）データのトランスポートと、トランスポートされたマルチメディアデータの受信ならびに後続の解釈および復号とを対象とする。本開示の技法は、たとえば、スケーラブルビデオコーディング（ＳＶＣ）、アドバンストビデオコーディング（ＡＶＣ）、ＯＳＩベースレイヤ、あるいはマルチビュービデオコーディング（ＭＶＣ）データ、または複数のビューを備える他のビデオデータなど、様々な規格および拡張のビデオデータのトランスポートに適用され得る。図１の例に示すように、ビデオソース２４はシーンの複数のビューをビデオエンコーダ２８に与え得る。ビデオデータの複数のビューは、立体視または自動立体視３次元ディスプレイなど、３次元ディスプレイによって使用されるべき３次元ビデオデータを生成するために有用であり得る。 The techniques of this disclosure are generally directed to transporting encoded multimedia (eg, audio and video) data and receiving and subsequent interpretation and decoding of the transported multimedia data. The techniques of this disclosure may be used in various ways, for example, scalable video coding (SVC), advanced video coding (AVC), OSI base layer, or multi-view video coding (MVC) data, or other video data comprising multiple views. It can be applied to standard and extended video data transport. As shown in the example of FIG. 1, video source 24 may provide multiple views of a scene to video encoder 28. Multiple views of video data may be useful for generating 3D video data to be used by a 3D display, such as a stereoscopic or autostereoscopic 3D display.

Ａ／Ｖソースデバイス２０は、Ａ／Ｖ宛先デバイス４０に「サービス」を提供し得る。サービスは、概して、ＭＶＣデータの利用可能なビューのサブセットに対応する。たとえば、マルチビュービデオデータは、０から７まで順序付けられた８つのビューについて利用可能であり得る。１つのサービスは２つビューを有するステレオビデオに対応し得るが、別のサービスは４つのビューに対応し得、さらに別のサービスは８つのビューすべてに対応し得る。概して、サービスは、利用可能なビューの任意の組合せ（すなわち、任意のサブセット）に対応する。サービスはまた、利用可能なビューならびにオーディオデータの組合せに対応し得る。 The A / V source device 20 may provide a “service” to the A / V destination device 40. A service generally corresponds to a subset of the available views of MVC data. For example, multi-view video data may be available for 8 views ordered from 0 to 7. One service may correspond to stereo video with two views, while another service may correspond to four views, and yet another service may correspond to all eight views. In general, a service corresponds to any combination of available views (ie, any subset). The service may also correspond to available views as well as combinations of audio data.

Ａ／Ｖソースデバイス２０は、本開示の技法に従って、ビューのサブセットに対応するサービスを提供することができる。概して、ビューは、「ｖｉｅｗ＿ｉｄ」とも呼ばれるビュー識別子によって表される。ビュー識別子は、概して、ビューを識別するために使用され得るシンタックス要素を備える。ビューが符号化されるとき、ＭＶＣエンコーダはビューのｖｉｅｗ＿ｉｄを与える。ｖｉｅｗ＿ｉｄは、ＭＶＣデコーダによってビュー間予測（inter-view prediction）のために使用されるか、または他のユニットによって他の目的、たとえばレンダリングのために使用され得る。 A / V source device 20 may provide services corresponding to a subset of views in accordance with the techniques of this disclosure. Generally, a view is represented by a view identifier, also called “view_id”. A view identifier generally comprises a syntax element that can be used to identify a view. When a view is encoded, the MVC encoder gives the view's view_id. The view_id may be used for inter-view prediction by the MVC decoder, or may be used for other purposes, eg, rendering, by other units.

ビュー間予測は、フレームのＭＶＣビデオデータを、共通の時間ロケーションにおける１つまたは複数のフレームを参照して、異なるビューの符号化フレームとして符号化するための技法である。以下でさらに詳細に説明する図７は、ビュー間予測のための例示的なコーディング方式を与えている。概して、ＭＶＣビデオデータの符号化フレームは、空間的に、時間的に、および／または共通の時間ロケーションにおける他のビューのフレームを参照して、予測符号化され得る。したがって、他のビューがそこから予測される参照ビューは、概して、参照ビューを復号するときに、復号された参照ビューが参照のために使用され得るように、参照ビューが参照として働くビューの前に復号される。復号順序は必ずしもｖｉｅｗ＿ｉｄの順序に対応しない。したがって、ビューの復号順序はビュー順序インデックスを使用して記述される。ビュー順序インデックスは、アクセスユニット中の対応するビュー構成要素の復号順序を示すインデックスである。 Inter-view prediction is a technique for encoding MVC video data of a frame as encoded frames of different views with reference to one or more frames at a common temporal location. FIG. 7, described in further detail below, provides an exemplary coding scheme for inter-view prediction. In general, an encoded frame of MVC video data may be predictively encoded spatially, temporally, and / or with reference to frames of other views at a common temporal location. Thus, a reference view from which other views are predicted is generally prior to the view in which the reference view serves as a reference, so that when the reference view is decoded, the decoded reference view can be used for reference. Is decrypted. The decoding order does not necessarily correspond to the order of view_id. Therefore, the decoding order of views is described using the view order index. The view order index is an index indicating the decoding order of corresponding view components in the access unit.

各個のデータストリームは（オーディオかビデオかにかかわらず）エレメンタリーストリームと呼ばれる。エレメンタリーストリームは、デジタル的にコード化された（場合によっては圧縮された）プログラムの単一の構成要素である。たとえば、プログラムのコード化ビデオまたはオーディオ部分はエレメンタリーストリームであり得る。エレメンタリーストリームは、プログラムストリームまたはトランスポートストリームに多重化される前に、パケット化エレメンタリーストリーム（ＰＥＳ）に変換され得る。同じプログラム内では、１つのエレメンタリーストリームに属するＰＥＳパケットを他のものから区別するためにストリームＩＤが使用される。エレメンタリーストリームの基本データ単位はパケット化エレメンタリーストリーム（ＰＥＳ）パケットである。したがって、ＭＶＣビデオデータの各ビューはそれぞれのエレメンタリーストリームに対応する。同様に、オーディオデータは１つまたは複数のそれぞれのエレメンタリーストリームに対応する。 Each individual data stream (whether audio or video) is called an elementary stream. An elementary stream is a single component of a digitally encoded (possibly compressed) program. For example, the coded video or audio portion of the program can be an elementary stream. The elementary stream may be converted to a packetized elementary stream (PES) before being multiplexed into the program stream or transport stream. Within the same program, the stream ID is used to distinguish PES packets belonging to one elementary stream from others. The basic data unit of the elementary stream is a packetized elementary stream (PES) packet. Therefore, each view of the MVC video data corresponds to each elementary stream. Similarly, the audio data corresponds to one or more respective elementary streams.

ＭＶＣコード化ビデオシーケンスは、各々がエレメンタリーストリームであるいくつかのサブビットストリームに分離され得る。各サブビットストリームは、ＭＶＣｖｉｅｗ＿ｉｄサブセットを使用して識別され得る。各ＭＶＣｖｉｅｗ＿ｉｄサブセットの概念に基づいて、ＭＶＣビデオサブビットストリームが定義される。ＭＶＣビデオサブビットストリームは、ＭＶＣｖｉｅｗ＿ｉｄサブセットに記載されているビューのＮＡＬユニットを含んでいる。プログラムストリームは、概して、エレメンタリーストリームのものであるＮＡＬユニットのみを含んでいる。それはまた、２つのエレメンタリーストリームが同じビューを含んでいることができないように設計されている。 The MVC coded video sequence can be separated into several sub-bitstreams, each of which is an elementary stream. Each sub-bitstream may be identified using the MVC view_id subset. Based on the concept of each MVC view_id subset, an MVC video sub-bitstream is defined. The MVC video sub-bitstream includes NAL units of views described in the MVC view_id subset. A program stream generally includes only NAL units that are of elementary streams. It is also designed so that no two elementary streams can contain the same view.

図１の例では、マルチプレクサ３０は、ビデオエンコーダ２８からビデオデータを備えるエレメンタリーストリームを受信し、オーディオエンコーダ２６からオーディオデータを備えるエレメンタリーストリームを受信する。いくつかの例では、ビデオエンコーダ２８およびオーディオエンコーダ２６はそれぞれ、符号化データからＰＥＳパケットを形成するためのパケッタイザを含み得る。他の例では、ビデオエンコーダ２８およびオーディオエンコーダ２６はそれぞれ、符号化データからＰＥＳパケットを形成するためのパケッタイザとインターフェースし得る。さらに他の例では、マルチプレクサ３０は、符号化オーディオデータと符号化ビデオデータとからＰＥＳパケットを形成するためのパケッタイザを含み得る。 In the example of FIG. 1, the multiplexer 30 receives an elementary stream including video data from the video encoder 28 and receives an elementary stream including audio data from the audio encoder 26. In some examples, video encoder 28 and audio encoder 26 may each include a packetizer for forming a PES packet from the encoded data. In other examples, video encoder 28 and audio encoder 26 may each interface with a packetizer for forming PES packets from encoded data. In yet another example, multiplexer 30 may include a packetizer for forming PES packets from encoded audio data and encoded video data.

本開示で使用する「プログラム」は、オーディオデータとビデオデータの組合せ、たとえばＡ／Ｖソースデバイス２０のサービスによって配信されたオーディオエレメンタリーストリームと利用可能なビューのサブセットとを備え得る。各ＰＥＳパケットは、ＰＥＳパケットが属するエレメンタリーストリームを識別するｓｔｒｅａｍ＿ｉｄを含む。マルチプレクサ３０は、エレメンタリーストリームを構成プログラムストリームまたはトランスポートストリームにアセンブルし得る。プログラムストリームとトランスポートストリームとは、異なるアプリケーションをターゲットにする２つの代替多重である。 A “program” as used in this disclosure may comprise a combination of audio data and video data, eg, an audio elementary stream delivered by a service of the A / V source device 20 and a subset of available views. Each PES packet includes a stream_id that identifies the elementary stream to which the PES packet belongs. Multiplexer 30 may assemble the elementary stream into a constituent program stream or a transport stream. A program stream and a transport stream are two alternative multiplexes that target different applications.

概して、プログラムストリームは１つのプログラムのデータを含み、トランスポートストリームは１つまたは複数のプログラムのデータを含み得る。マルチプレクサ３０は、提供されているサービス、ストリームが渡される媒体、送られるべきプログラムの数、または他の考慮事項に基づいて、プログラムストリームまたはトランスポートストリームのいずれかあるいは両方を符号化し得る。たとえば、記憶媒体中のビデオデータが符号化されるべきであるときは、マルチプレクサ３０はプログラムストリームを形成する可能性がより高くなり得、ビデオデータがネットワークを介してストリーミングされるか、ブロードキャストされるか、またはビデオテレフォニーの一部として送られるべきであるときは、マルチプレクサ３０はトランスポートストリームを使用する可能性がより高くなり得る。 In general, a program stream may include data for one program and a transport stream may include data for one or more programs. Multiplexer 30 may encode either or both of the program stream and / or transport stream based on the service provided, the medium on which the stream is passed, the number of programs to be sent, or other considerations. For example, when video data in a storage medium is to be encoded, multiplexer 30 may be more likely to form a program stream and the video data is streamed or broadcast over a network. Or when to be sent as part of video telephony, the multiplexer 30 may be more likely to use a transport stream.

マルチプレクサ３０は、デジタルストレージサービスからの単一のプログラムの記憶および表示のためにプログラムストリームを使用することのほうを優先してバイアスされ得る。プログラムストリームはむしろ誤りが起こりやすいので、プログラムストリームは、誤りのない環境、または誤りがより起こりにくい環境での使用を対象とする。プログラムストリームは、それに属するエレメンタリーストリームを備えるにすぎず、通常、可変長さのパケットを含んでいる。プログラムストリームでは、寄与しているエレメンタリーストリームから導出されたＰＥＳパケットが「パック」に編成される。パックは、パックヘッダと、随意のシステムヘッダと、寄与しているエレメンタリーストリームのいずれかから取られる任意の数のＰＥＳパケットとを任意の順序で備える。システムヘッダは、プログラムストリームの最大データレート、寄与しているビデオおよびオーディオエレメンタリーストリームの数、さらなるタイミング情報、または他の情報など、プログラムストリームの特性の概要を含んでいる。デコーダは、デコーダがプログラムストリームを復号することが可能か否かを判断するために、システムヘッダ中に含まれている情報を使用し得る。 Multiplexer 30 may be biased in favor of using a program stream for storage and display of a single program from a digital storage service. Because program streams are rather error prone, program streams are intended for use in error-free or less error-prone environments. The program stream only comprises elementary streams belonging to it, and usually includes variable length packets. In the program stream, PES packets derived from the contributing elementary stream are organized into “packs”. A pack comprises a pack header, an optional system header, and any number of PES packets taken from any of the contributing elementary streams in any order. The system header contains a summary of the characteristics of the program stream, such as the maximum data rate of the program stream, the number of contributing video and audio elementary streams, additional timing information, or other information. The decoder may use the information contained in the system header to determine whether the decoder can decode the program stream.

マルチプレクサ３０は、潜在的に誤りを起こしやすいチャネルを介した複数のプログラムの同時配信のためにトランスポートストリームを使用し得る。トランスポートストリームは、単一のトランスポートストリームが多くの独立したプログラムに適応することができるように、ブロードキャストなどのマルチプログラムアプリケーションのために考案された多重である。トランスポートストリームはトランスポートパケットの連続を備え、トランスポートパケットの各々は長さ１８８バイトである。短い、固定長パケットの使用は、トランスポートストリームがプログラムストリームよりも誤りが起こりにくいことを意味する。さらに、各長さ１８８バイトのトランスポートパケットは、リードソロモン符号化などの標準誤り防止プロセスを通してパケットを処理することによって追加の誤り保護を与えられ得る。トランスポートストリームの誤り耐性の改善は、たとえば、ブロードキャスト環境において発見されるべき、誤りを起こしやすいチャネルを克服する可能性がより高いことを意味する。 Multiplexer 30 may use the transport stream for simultaneous delivery of multiple programs over a potentially error-prone channel. A transport stream is a multiplex designed for multi-program applications such as broadcast, so that a single transport stream can be adapted to many independent programs. The transport stream comprises a sequence of transport packets, each of which is 188 bytes long. The use of short, fixed-length packets means that the transport stream is less error prone than the program stream. In addition, each 188 byte long transport packet may be given additional error protection by processing the packet through a standard error prevention process such as Reed-Solomon coding. Improved transport stream error resilience means that, for example, it is more likely to overcome error-prone channels that are to be discovered in a broadcast environment.

トランスポートストリームは、その誤り耐性の向上と多くの同時プログラムを搬送する能力との２つの多重のうちのより良好な多重であるように見えることがある。ただし、トランスポートストリームは、プログラムストリームよりもさらに高度な多重であり、したがって、作成および多重分離することがより困難である。トランスポートパケットの最初のバイトは、０ｘ４７の値（１６進値４７、２進値「０１０００１１１」、１０進値７１）を有する同期バイトである。単一のトランスポートストリームは多くの異なるプログラムを搬送し得、各プログラムは多くのパケット化エレメンタリーストリームを備える。マルチプレクサ３０は、１つのエレメンタリーストリームのデータを含んでいるトランスポートパケットを、他のエレメンタリーストリームのデータを搬送しているものと区別するために１３ビットパケット識別子（ＰＩＤ）フィールドを使用し得る。各エレメンタリーストリームが一意のＰＩＤ値を与えられることを保証することは、マルチプレクサの責任である。トランスポートパケットの最後のバイトは連続性カウントフィールドである。マルチプレクサ３０は、同じエレメンタリーストリームに属する連続するトランスポートパケット間で連続性カウントフィールドの値を増分する。これは、Ａ／Ｖ宛先デバイス４０など、宛先デバイスのデコーダまたは他のユニットがトランスポートパケットの損失または利得を検出し、他の場合はそのようなイベントから生じ得る誤りを願わくは隠匿することを可能にする。 A transport stream may appear to be a better multiplex of the two multiplexes, with its improved error resilience and the ability to carry many simultaneous programs. However, transport streams are much more multiplexed than program streams and are therefore more difficult to create and demultiplex. The first byte of the transport packet is a synchronization byte having a value of 0x47 (hexadecimal value 47, binary value “01000111”, decimal value 71). A single transport stream can carry many different programs, each program comprising many packetized elementary streams. Multiplexer 30 may use a 13-bit packet identifier (PID) field to distinguish transport packets that contain data from one elementary stream from those that carry data from other elementary streams. . It is the responsibility of the multiplexer to ensure that each elementary stream is given a unique PID value. The last byte of the transport packet is a continuity count field. The multiplexer 30 increments the value of the continuity count field between consecutive transport packets belonging to the same elementary stream. This allows a decoder or other unit of the destination device, such as A / V destination device 40, to detect the loss or gain of the transport packet and otherwise conceal errors that may arise from such events. To.

マルチプレクサ３０は、オーディオエンコーダ２６とビデオエンコーダ２８とからプログラムのエレメンタリーストリームのＰＥＳパケットを受信し、ＰＥＳパケットから対応するネットワークアブストラクションレイヤ（ＮＡＬ）ユニットを形成する。Ｈ．２６４／ＡＶＣ（アドバンストビデオコーディング）の例では、コード化ビデオセグメントは、ビデオテレフォニー、ストレージ、ブロードキャスト、またはストリーミングなどのアプリケーションに対処する「ネットワークフレンドリーな」ビデオ表現を与えるＮＡＬユニットに編成される。ＮＡＬユニットは、ＶｉｄｅｏＣｏｄｉｎｇＬａｙｅｒ（ＶＣＬ）ＮＡＬユニットと非ＶＣＬＮＡＬユニットとにカテゴリー分類され得る。ＶＣＬユニットは、コア圧縮エンジンを含んでおり、ブロック、マクロブロック、および／またはスライスレベルを備え得る。他のＮＡＬユニットは非ＶＣＬＮＡＬユニットである。 The multiplexer 30 receives the PES packet of the elementary stream of the program from the audio encoder 26 and the video encoder 28, and forms a corresponding network abstraction layer (NAL) unit from the PES packet. H. In the H.264 / AVC (Advanced Video Coding) example, the coded video segments are organized into NAL units that provide a “network friendly” video representation that addresses applications such as video telephony, storage, broadcast, or streaming. NAL units may be categorized into Video Coding Layer (VCL) NAL units and non-VCL NAL units. The VCL unit includes a core compression engine and may comprise block, macroblock, and / or slice levels. Other NAL units are non-VCL NAL units.

マルチプレクサ３０は、ＮＡＬが属するプログラムを識別するヘッダ、ならびにペイロード、たとえば、オーディオデータ、ビデオデータ、あるいはＮＡＬユニットが対応するトランスポートまたはプログラムストリームを記述するデータを備えるＮＡＬユニットを形成し得る。たとえば、Ｈ．２６４／ＡＶＣでは、ＮＡＬユニットは１バイトのヘッダと変動するサイズのペイロードとを含み得る。一例では、ＮＡＬユニットヘッダは、ｐｒｉｏｒｉｔｙ＿ｉｄ要素と、ｔｅｍｐｏｒａｌ＿ｉｄ要素と、ａｎｃｈｏｒ＿ｐｉｃ＿ｆｌａｇ要素と、ｖｉｅｗ＿ｉｄ要素と、ｎｏｎ＿ｉｄｒ＿ｆｌａｇ要素と、ｉｎｔｅｒ＿ｖｉｅｗ＿ｆｌａｇ要素とを備える。従来のＭＶＣでは、４バイトＭＶＣＮＡＬユニットヘッダとＮＡＬユニットペイロードとを含む、プレフィックスＮＡＬユニットとＭＶＣコード化スライスＮＡＬユニットとを除いて、Ｈ．２６４によって定義されたＮＡＬユニットが保持される。 Multiplexer 30 may form a NAL unit comprising a header identifying the program to which the NAL belongs and a payload, eg, audio data, video data, or data describing the transport or program stream to which the NAL unit corresponds. For example, H.M. In H.264 / AVC, a NAL unit may include a 1-byte header and a variable size payload. In one example, the NAL unit header includes a priority_id element, a temporal_id element, an anchor_pic_flag element, a view_id element, a non_idr_flag element, and an inter_view_flag element. In conventional MVC, except for prefix NAL units and MVC coded slice NAL units, which include a 4-byte MVC NAL unit header and a NAL unit payload, NAL units defined by H.264 are retained.

ＮＡＬヘッダのｐｒｉｏｒｉｔｙ＿ｉｄ要素は、単純なワンパス（one-path）ビットストリーム適合プロセスのために使用され得る。ｔｅｍｐｏｒａｌ＿ｉｄ要素は、異なる時間レベルが異なるフレームレートに対応する場合、対応するＮＡＬユニットの時間レベルを指定するために使用され得る。 The priority_id element of the NAL header can be used for a simple one-path bitstream adaptation process. The temporal_id element may be used to specify the time level of the corresponding NAL unit when different time levels correspond to different frame rates.

ａｎｃｈｏｒ＿ｐｉｃ＿ｆｌａｇ要素は、ピクチャがアンカーピクチャであるか非アンカーピクチャであるかを示し得る。アンカーピクチャと出力順序（すなわち、表示順序）でそれに続くすべてのピクチャとは、復号順序（すなわち、ビットストリーム順序）で前のピクチャを復号することなしに正しく復号され得、したがってランダムアクセスポイントとして使用され得る。アンカーピクチャと非アンカーピクチャとは異なる依存性を有することができ、その両方はシーケンスパラメータセット中でシグナリングされる。他のフラグについては、本章の以下のセクションで説明され、使用される。そのようなアンカーピクチャはまた、開いたＧＯＰ（ＧｒｏｕｐＯｆＰｉｃｔｕｒｅｓ）アクセスポイントと呼ばれることもあり、ｎｏｎ＿ｉｄｒ＿ｆｌａｇ要素が０に等しいとき、閉じたＧＯＰアクセスポイントもサポートされる。ｎｏｎ＿ｉｄｒ＿ｆｌａｇ要素は、ピクチャが瞬間デコーダリフレッシュ（ＩＤＲ）であるかビューＩＤＲ（Ｖ−ＩＤＲ）ピクチャであるかを示す。概して、ＩＤＲピクチャと出力順序またはビットストリーム順序でそれに続くすべてのピクチャとは、復号順序または表示順序で前のピクチャを復号することなしに正しく復号され得る。 The anchor_pic_flag element may indicate whether the picture is an anchor picture or a non-anchor picture. The anchor picture and all pictures that follow it in the output order (ie display order) can be decoded correctly without decoding the previous picture in decoding order (ie bitstream order) and are therefore used as random access points Can be done. Anchor pictures and non-anchor pictures can have different dependencies, both of which are signaled in the sequence parameter set. Other flags are described and used in the following sections of this chapter. Such anchor pictures may also be referred to as open GOP (Group Of Pictures) access points, and closed GOP access points are also supported when the non_idr_flag element is equal to 0. The non_idr_flag element indicates whether the picture is an instantaneous decoder refresh (IDR) or a view IDR (V-IDR) picture. In general, an IDR picture and all pictures that follow it in output order or bitstream order can be correctly decoded without decoding the previous picture in decoding order or display order.

ｖｉｅｗ＿ｉｄ要素は、ＭＶＣデコーダ内でデータ対話性のために、たとえば、ビュー間予測のために、およびデコーダ外で、たとえば、レンダリングのために使用され得る、ビューを識別するために使用され得るシンタックス情報を備える。ｉｎｔｅｒ＿ｖｉｅｗ＿ｆｌａｇ要素は、対応するＮＡＬユニットが他のビューによってビュー間予測のために使用されるかどうかを指定し得る。ＡＶＣに準拠し得る、ベースビューの４バイトＮＡＬユニットヘッダ情報を搬送するために、ＭＶＣにおいてプレフィックスＮＡＬユニットが定義される。ＭＶＣのコンテキストにおいて、ベースビューアクセスユニットは、ビューの現在の時間インスタンスのＶＣＬＮＡＬユニット、ならびにＮＡＬユニットヘッドのみを含んでいるプレフィックスＮＡＬユニットを含む。Ｈ．２６４／ＡＶＣデコーダはプレフィックスＮＡＬユニットを無視し得る。 The view_id element is a syntax that can be used to identify a view that can be used for data interactivity within the MVC decoder, eg, for inter-view prediction, and outside the decoder, eg, for rendering. Provide information. The inter_view_flag element may specify whether the corresponding NAL unit is used for inter-view prediction by other views. Prefix NAL units are defined in MVC to carry base view 4-byte NAL unit header information, which may be AVC compliant. In the context of MVC, the base view access unit includes a VCL NAL unit for the current time instance of the view, as well as a prefix NAL unit that contains only the NAL unit head. H. The H.264 / AVC decoder may ignore the prefix NAL unit.

そのペイロード中にビデオデータを含むＮＡＬユニットは、様々なグラニュラリティレベルのビデオデータを備え得る。たとえば、ＮＡＬユニットは、ビデオデータのブロック、マクロブロック、複数のマクロブロック、ビデオデータのスライス、またはビデオデータのフレーム全体を備え得る。 A NAL unit that includes video data in its payload may comprise video data of various granularity levels. For example, a NAL unit may comprise a block of video data, a macroblock, a plurality of macroblocks, a slice of video data, or an entire frame of video data.

概して、アクセスユニットは、ビデオデータのフレームを表すための１つまたは複数のＮＡＬユニット、ならびにそのフレームに対応するオーディオデータが利用可能なとき、そのようなオーディオデータを備え得る。アクセスユニットは、概して、１つの出力時間インスタンスにわたるすべてのＮＡＬユニット、たとえば１つの時間インスタンスにわたるすべてのオーディオおよびビデオデータを含む。Ｈ．２６４／ＡＶＣに対応する例では、アクセスユニットは、１次コード化ピクチャとして提示され得る、１つの時間インスタンス中のコード化ピクチャを備え得る。したがって、アクセスユニットは、共通の時間インスタンスのすべてのビデオフレーム、たとえば、時間Ｘに対応するすべてのビュー構成要素を備え得る。 In general, an access unit may comprise one or more NAL units for representing a frame of video data, as well as such audio data when audio data corresponding to that frame is available. An access unit generally includes all NAL units over one output time instance, eg, all audio and video data over one time instance. H. In an example corresponding to H.264 / AVC, an access unit may comprise a coded picture in one time instance that may be presented as a primary coded picture. Thus, an access unit may comprise all view components corresponding to all video frames of a common time instance, eg, time X.

本開示はまた、特定のビューの符号化ピクチャを「ビュー構成要素」と呼ぶ。すなわち、ビュー構成要素は、特定の時間における特定のビューの符号化ピクチャ（またはフレーム）を備える。したがって、アクセスユニットは、いくつかの例では、共通の時間インスタンスのすべてのビュー構成要素を備え得る。アクセスユニットの復号順序は、必ずしも出力または表示順序と同じである必要はない。連続するアクセスユニットのセットは、ピクチャグループ（ＧＯＰ）またはＮＡＬユニットビットストリームまたはサブビットストリームの他の単独で復号可能な単位に対応し得る符号化ビデオシーケンスを形成し得る。 This disclosure also refers to the coded picture of a particular view as a “view component”. That is, the view component comprises a coded picture (or frame) of a particular view at a particular time. Thus, an access unit may comprise all view components of a common time instance in some examples. The decoding order of access units does not necessarily have to be the same as the output or display order. A set of consecutive access units may form a coded video sequence that may correspond to a picture group (GOP) or other independently decodable unit of a NAL unit bitstream or sub-bitstream.

多くのビデオコーディング規格の場合と同様に、Ｈ．２６４／ＡＶＣは、誤りのないビットストリームのシンタックスと、セマンティクスと、復号プロセスとを定義し、そのいずれかは特定のプロファイルまたはレベルに準拠する。Ｈ．２６４／ＡＶＣはエンコーダを指定しないが、エンコーダは、生成されたビットストリームがデコーダの規格に準拠することを保証することを課される。ビデオコーディング規格のコンテキストにおいて、「プロファイル」は、アルゴリズム、機能、またはそれらに適用するツールおよび制約のサブセットに対応する。たとえば、Ｈ．２６４規格によって定義される「プロファイル」は、Ｈ．２６４規格によって指定されたビットストリームシンタックス全体のサブセットである。「レベル」は、たとえば、ピクチャの解像度、ビットレート、およびマクロブロック（ＭＢ）処理レートに関係するデコーダメモリおよび計算など、デコーダリソース消費の制限に対応する。 As with many video coding standards, H.264 / AVC defines error-free bitstream syntax, semantics, and decoding processes, either of which conform to a specific profile or level. H. H.264 / AVC does not specify an encoder, but the encoder is required to ensure that the generated bitstream conforms to the decoder standard. In the context of a video coding standard, a “profile” corresponds to a subset of algorithms, functions, or tools and constraints that apply to them. For example, H.M. The “profile” defined by the H.264 standard is H.264. A subset of the entire bitstream syntax specified by the H.264 standard. A “level” corresponds to a limit on decoder resource consumption, eg, decoder memory and computations related to picture resolution, bit rate, and macroblock (MB) processing rate.

Ｈ．２６４規格は、たとえば、与えられたプロファイルのシンタックスによって課される限界内で、復号されたピクチャの指定されたサイズなど、ビットストリーム中のシンタックス要素がとる値に応じて、エンコーダおよびデコーダのパフォーマンスの大きい変動を必要とする可能性が依然としてあることを認識している。Ｈ．２６４規格は、多くのアプリケーションにおいて、特定のプロファイル内でシンタックスのすべての仮定的使用を処理することが可能なデコーダを実装することが実際的でもなく、経済的でもないことをさらに認識している。したがって、Ｈ．２６４規格は、ビットストリーム中のシンタックス要素の値に課せられた制約の指定されたセットとして「レベル」を定義している。これらの制約は、値に関する単純な限界であり得る。代替的に、これらの制約は、値の演算の組合せ（たとえば、ピクチャの幅×ピクチャ高さ×毎秒復号されるピクチャの数）に関する制約の形態をとり得る。Ｈ．２６４規格は、個別の実装形態が、サポートされるプロファイルごとに異なるレベルをサポートし得ることをさらに規定している。 H. The H.264 standard, for example, depends on the values taken by syntax elements in the bitstream, such as the specified size of the decoded picture, within the limits imposed by the syntax of a given profile. We recognize that there may still be a need for large fluctuations in performance. H. The H.264 standard further recognizes that in many applications it is neither practical nor economical to implement a decoder that can handle all hypothetical uses of syntax within a particular profile. Yes. Therefore, H.H. The H.264 standard defines a “level” as a specified set of constraints imposed on the values of syntax elements in a bitstream. These constraints can be simple limits on values. Alternatively, these constraints may take the form of constraints on a combination of value operations (eg, picture width × picture height × number of pictures decoded per second). H. The H.264 standard further defines that individual implementations may support different levels for each supported profile.

プロファイルに準拠するデコーダは、通常、プロファイル中で定義されたすべての機能をサポートする。たとえば、コーディング機能として、Ｂピクチャコーディングは、Ｈ．２６４／ＡＶＣのベースラインプロファイルではサポートされず、Ｈ．２６４／ＡＶＣの他のプロファイルではサポートされる。レベルに準拠するデコーダは、レベルにおいて定義された制限を超えてリソースを必要としない任意のビットストリームを復号することが可能である必要がある。プロファイルおよびレベルの定義は、説明可能性のために役立ち得る。たとえば、ビデオ送信中に、プロファイル定義とレベル定義のペアが全送信セッションについてネゴシエートされ、同意され得る。より詳細には、Ｈ．２６４／ＡＶＣでは、レベルは、たとえば、処理する必要があるマクロブロックの数に関する制限と、復号されたピクチャバッファ（ＤＰＢ）サイズと、コード化ピクチャバッファ（ＣＰＢ）サイズと、垂直動きベクトル範囲と、２つの連続するＭＢごとの動きベクトルの最大数と、Ｂブロックが８×８ピクセル未満のサブマクロブロックパーティションを有することができるかどうかとを定義し得る。このようにして、デコーダは、デコーダがビットストリームを適切に復号することが可能であるかどうかを判断し得る。 Profile compliant decoders typically support all functions defined in the profile. For example, as a coding function, B-picture coding is H.264. H.264 / AVC baseline profile is not supported. Supported in other H.264 / AVC profiles. A level compliant decoder needs to be able to decode any bitstream that does not require resources beyond the limits defined in the level. Profile and level definitions can be useful for accountability. For example, during video transmission, a profile definition and level definition pair may be negotiated and agreed upon for all transmission sessions. More particularly, In H.264 / AVC, levels are, for example, limits on the number of macroblocks that need to be processed, decoded picture buffer (DPB) size, coded picture buffer (CPB) size, vertical motion vector range, One can define the maximum number of motion vectors per two consecutive MBs and whether a B block can have sub-macroblock partitions of less than 8 × 8 pixels. In this way, the decoder may determine whether the decoder can properly decode the bitstream.

パラメータセットは、概して、シーケンスパラメータセット（ＳＰＳ）中のシーケンスレイヤヘッダ情報とピクチャパラメータセット（ＰＰＳ）中のまれに変化するピクチャレイヤヘッダ情報とを含んでいる。パラメータセットがある場合、このまれに変化する情報をシーケンスごとまたはピクチャごとに繰り返す必要はなく、したがってコーディング効率が改善され得る。さらに、パラメータセットの使用はヘッダ情報の帯域外送信を可能にし得、誤り耐性を達成するために冗長送信の必要を回避する。帯域外送信では、他のＮＡＬユニットとは異なるチャネル上でパラメータセットＮＡＬユニットが送信される。 The parameter set generally includes sequence layer header information in the sequence parameter set (SPS) and rarely changing picture layer header information in the picture parameter set (PPS). If there is a parameter set, this infrequently changing information does not need to be repeated for each sequence or picture, thus coding efficiency can be improved. In addition, the use of parameter sets may allow out-of-band transmission of header information, avoiding the need for redundant transmissions to achieve error resilience. In out-of-band transmission, a parameter set NAL unit is transmitted on a different channel than other NAL units.

本開示の技法は、メディアエクストラクタトラック中にエクストラクタを含むことに関与する。本開示のエクストラクタは、共通のファイル中の別のトラックの２つ以上のＮＡＬユニットを参照し得る。すなわち、ファイルは、複数のＮＡＬユニットを有する第１のトラックと、第１のトラックの複数のＮＡＬユニットの２つ以上を識別するエクストラクタを含む第２のトラックとを含み得る。概して、エクストラクタにデマルチプレクサ３８が遭遇したとき、デマルチプレクサ３８が第１のトラックからエクストラクタによって識別されたＮＡＬユニットを検索し、それらのＮＡＬユニットをビデオデコーダ４８に送り得るように、エクストラクタはポインタとして働き得る。エクストラクタを含むトラックは、メディアエクストラクタトラックと呼ばれることがある。本開示のエクストラクタは、様々なファイルフォーマット、たとえば、ＩＳＯベースメディアファイルフォーマット、スケーラブルビデオコーディング（ＳＶＣ）ファイルフォーマット、アドバンストビデオコーディング（ＡＶＣ）ファイルフォーマット、第３世代パートナーシッププロジェクト（３ＧＰＰ）ファイルフォーマット、および／またはマルチビュービデオコーディング（ＭＶＣ）ファイルフォーマットに準拠するファイル中に含まれ得る。 The techniques of this disclosure involve including an extractor in a media extractor track. An extractor of the present disclosure may reference two or more NAL units on different tracks in a common file. That is, the file may include a first track having a plurality of NAL units and a second track including an extractor that identifies two or more of the plurality of NAL units of the first track. In general, when the demultiplexer 38 encounters an extractor, the extractor 38 can retrieve the NAL units identified by the extractor from the first track and send those NAL units to the video decoder 48. Can act as a pointer. A track that includes an extractor may be referred to as a media extractor track. The extractors of the present disclosure include various file formats, such as ISO base media file format, scalable video coding (SVC) file format, advanced video coding (AVC) file format, third generation partnership project (3GPP) file format, and And / or may be included in a file that conforms to a multi-view video coding (MVC) file format.

概して、ビデオファイルの様々なトラックはスイッチトラックとして使用され得る。すなわち、マルチプレクサ３０は、様々なフレームレート、ディスプレイ能力、および／または復号能力をサポートするために様々なトラックを含み得る。たとえば、ビデオファイルがＭＶＣファイルフォーマットに準拠するとき、各トラックは異なるＭＶＣ動作点を表し得る。したがって、デマルチプレクサ３８は、ＮＡＬユニットを検索すべきトラックのうちの１つを選択し、選択されたトラックのエクストラクタによって識別されるＮＡＬユニット以外の他のトラックのデータを廃棄するように構成され得る。すなわち、選択されたトラックが、別のトラックのＮＡＬユニットを参照するエクストラクタを含むとき、デマルチプレクサ３８は他のトラックの参照されないＮＡＬユニットを廃棄する一方、参照されたＮＡＬユニットを抽出し得る。デマルチプレクサ３８は、抽出されたＮＡＬユニットをビデオデコーダ４８に送り得る。 In general, various tracks of a video file can be used as switch tracks. That is, multiplexer 30 may include various tracks to support various frame rates, display capabilities, and / or decoding capabilities. For example, when a video file conforms to the MVC file format, each track may represent a different MVC operating point. Accordingly, the demultiplexer 38 is configured to select one of the tracks for which the NAL unit is to be searched and discard data of other tracks other than the NAL unit identified by the extractor of the selected track. obtain. That is, when the selected track includes an extractor that references another track's NAL unit, the demultiplexer 38 may discard the unreferenced NAL unit of the other track while extracting the referenced NAL unit. The demultiplexer 38 may send the extracted NAL unit to the video decoder 48.

メディアエクストラクタトラック中のエクストラクタを使用することによって、本開示の技法は、ビデオファイルの様々なトラック間の時間スケーラビリティを達成するために使用され得る。ＭＰＥＧ−１およびＭＰＥＧ−２では、たとえば、Ｂ符号化ピクチャは、自然時間スケーラビリティを与える。ＭＰＥＧ−１またはＭＰＥＧ−２に準拠するビデオファイルの第１のトラックは、Ｉ符号化ピクチャとＰ符号化ピクチャとＢ符号化ピクチャとの完全セットを含み得る。ビデオファイルの第２のトラックは、第１のトラックのＩ符号化ピクチャおよびＰ符号化ピクチャのみを参照する１つまたは複数のエクストラクタを含み得、Ｂ符号化ピクチャへの参照を省略する。Ｂ符号化ピクチャを欠落させることによって、ビデオファイルは、ハーフ解像度ビデオ表現を確認することを達成し得る。また、ＭＰＥＧ−１およびＭＰＥＧ−２は、２つの時間レイヤをコーディングするベースレイヤおよびエンハンスメントレイヤ概念を与え、エンハンスメントレイヤピクチャは、各予測方向について、ベースレイヤまたはエンハンスメントレイヤのいずれかからピクチャを参照として選定することができる。 By using extractors in the media extractor tracks, the techniques of this disclosure can be used to achieve temporal scalability between the various tracks of the video file. In MPEG-1 and MPEG-2, for example, B-coded pictures provide natural time scalability. The first track of a video file compliant with MPEG-1 or MPEG-2 may include a complete set of I-coded pictures, P-coded pictures, and B-coded pictures. The second track of the video file may include one or more extractors that reference only the I and P encoded pictures of the first track, omitting references to the B encoded pictures. By missing the B-encoded picture, the video file can achieve confirming a half-resolution video representation. MPEG-1 and MPEG-2 also provide a base layer and enhancement layer concept for coding two temporal layers, and enhancement layer pictures refer to pictures from either the base layer or the enhancement layer for each prediction direction. Can be selected.

別の例として、Ｈ．２６４／ＡＶＣは、時間スケーラビリティをサポートするために階層Ｂ符号化ピクチャを使用する。Ｈ．２６４／ＡＶＣにおけるビデオシーケンスの第１のピクチャは、瞬間デコーダリフレッシュ（ＩＤＲ：Instantaneous Decoder Refresh）ピクチャと呼ばれることがあり、キーピクチャとしても知られている。キーピクチャは、一般に規則的な間隔または不規則な間隔でコーディングされ、動き補償予測のための参照として前のキーピクチャを使用してイントラコード化またはインターコード化のいずれかでコード化される。ピクチャグループ（ＧＯＰ）は、概して、キーピクチャと、そのキーピクチャと前のキーピクチャとの間に時間的に位置するすべてのピクチャとを含む。ＧＯＰは２つの部分に分割され得、一方はキーピクチャであり、他方は非キーピクチャを含む。非キーピクチャは、過去および将来からより低い時間レベルの最も近いピクチャである２つの参照ピクチャによって階層的に予測される。ピクチャの階層位置を示すために、時間識別子値が各ピクチャに割り当てられ得る。したがって、Ｎまでの時間識別子値をもつピクチャは、Ｎ−１までの時間識別子値をもつピクチャによって形成されたビデオセグメントのフレームレートの２倍のフレームレートをもつビデオセグメントを形成し得る。したがって、本開示の技法はまた、Ｎまでの時間識別子値をもつすべてのＮＡＬユニットを含む第１のトラックと、Ｎ−１までの時間識別子値をもつ第１のトラックのＮＡＬユニットを参照する１つまたは複数のエクストラクタを含む第２のトラックとを有することによって、Ｈ．２６４／ＡＶＣにおける時間スケーラビリティを達成するために使用され得る。 As another example, H.C. H.264 / AVC uses layer B coded pictures to support temporal scalability. H. The first picture of a video sequence in H.264 / AVC is sometimes referred to as an instantaneous decoder refresh (IDR) picture and is also known as a key picture. Key pictures are generally coded at regular or irregular intervals and are coded either intra-coded or inter-coded using the previous key picture as a reference for motion compensated prediction. A picture group (GOP) generally includes a key picture and all pictures located in time between the key picture and the previous key picture. A GOP may be divided into two parts, one is a key picture and the other contains non-key pictures. Non-key pictures are predicted hierarchically by two reference pictures, which are the closest pictures at lower time levels from the past and the future. A time identifier value may be assigned to each picture to indicate the hierarchical position of the picture. Thus, a picture with a time identifier value up to N may form a video segment with a frame rate twice that of the video segment formed by a picture with a time identifier value up to N-1. Thus, the techniques of this disclosure also refer to a first track that includes all NAL units with time identifier values up to N, and a NAL unit of the first track that has time identifier values up to N-1. Having a second track including one or more extractors. It can be used to achieve temporal scalability in H.264 / AVC.

上記のように、本開示の技法は、ＩＳＯベースメディアファイルフォーマット、スケーラブルビデオコーディング（ＳＶＣ）ファイルフォーマット、アドバンストビデオコーディング（ＡＶＣ）ファイルフォーマット、第３世代パートナーシッププロジェクト（３ＧＰＰ）ファイルフォーマット、および／またはマルチビュービデオコーディング（ＭＶＣ）ファイルフォーマットのいずれかに準拠するビデオファイルに適用され得る。ＩＳＯベースメディアファイルフォーマットは、メディアの交換、管理、編集、および提示を可能にする、フレキシブルな、拡張可能なフォーマットでの提示のための時限メディア情報を含んでいるように設計されている。ＩＳＯベースメディアファイルフォーマット（ＩＳＯ／ＩＥＣ１４４９６−１２：２００４）は、時間ベースメディアファイルのための一般的な構造を定義するＭＰＥＧ−４Ｐａｒｔ−１２において規定されている。それは、Ｈ．２６４／ＭＰＥＧ−４ＡＶＣビデオ圧縮のサポートのために定義されたＡＶＣファイルフォーマット（ＩＳＯ／ＩＥＣ１４４９６−１５）、３ＧＰＰファイルフォーマット、ＳＶＣファイルフォーマット、およびＭＶＣファイルフォーマットなど、ファミリー中の他のファイルフォーマットのための基礎として使用されている。３ＧＰＰファイルフォーマットおよびＭＶＣファイルフォーマットは、ＡＶＣファイルフォーマットの拡張である。ＩＳＯベースメディアファイルフォーマットは、オーディオビジュアルプレゼンテーションなど、メディアデータの時限シーケンスのためのタイミング、構造、およびメディア情報を含んでいる。ファイル構造はオブジェクト指向である。ファイルは、極めて簡単に基本オブジェクトに分解され得、オブジェクトの構造はそれらのタイプから暗示される。 As described above, the techniques of this disclosure may include an ISO base media file format, a scalable video coding (SVC) file format, an advanced video coding (AVC) file format, a third generation partnership project (3GPP) file format, and / or multiple It can be applied to video files that conform to any of the view video coding (MVC) file formats. The ISO base media file format is designed to include timed media information for presentation in a flexible, extensible format that allows media exchange, management, editing, and presentation. The ISO base media file format (ISO / IEC 14496-12: 2004) is specified in MPEG-4 Part-12, which defines a general structure for time-based media files. It is H.264 / MPEG-4 AVC file format defined for support of AVC video compression (ISO / IEC 14496-15), 3GPP file format, SVC file format, and other file formats in the family, such as MVC file format Has been used as a basis for. The 3GPP file format and the MVC file format are extensions of the AVC file format. The ISO base media file format includes timing, structure, and media information for timed sequences of media data, such as audiovisual presentations. The file structure is object oriented. Files can be decomposed into basic objects very easily and the structure of the objects is implied from their type.

ＩＳＯベースメディアファイルフォーマットに準拠するファイルは、「ボックス」と呼ばれる一連のオブジェクトとして形成される。ＩＳＯベースメディアファイルフォーマットでのデータは、ボックス中に含まれており、ファイル内に他のデータはない。これは、特定のファイルフォーマットによって必要とされる初期シグナチャを含む。「ボックス」は、一意のタイプ識別子および長さによって定義されたオブジェクト指向ビルディングブロックである。一般に、プレゼンテーションは１つのファイル中に含まれており、メディアプレゼンテーションは自蔵式である。ムービーコンテナ（ムービーボックス）は、メディアのメタデータを含んでおり、ビデオおよびオーディオフレームは、メディアデータコンテナ中に含まれており、他のファイル中にあり得る。 Files conforming to the ISO base media file format are formed as a series of objects called “boxes”. Data in the ISO base media file format is contained in a box and there is no other data in the file. This includes the initial signature required by the particular file format. A “box” is an object-oriented building block defined by a unique type identifier and length. In general, the presentation is contained in one file, and the media presentation is self-contained. A movie container (movie box) contains media metadata, and video and audio frames are contained in the media data container and may be in other files.

プレゼンテーション（モーションシーケンス）は、いくつかのファイル中に含まれ得る。すべてのタイミングおよびフレーミング（位置およびサイズ）情報は、概してＩＳＯベースメディアファイル中にあり、補助ファイルは、本質的に任意のフォーマットを使用し得る。このプレゼンテーションは、プレゼンテーションを含んでいるシステムにとって「ローカル」であり得るか、またはネットワークまたは他のストリーム配信機構を介することがある。 Presentations (motion sequences) can be included in several files. All timing and framing (position and size) information is generally in ISO base media files, and auxiliary files can use essentially any format. This presentation may be “local” to the system containing the presentation or may be via a network or other stream delivery mechanism.

ファイルは、論理構造と、時間構造と、物理構造とを有し得、これらの構造は結合される必要はない。ファイルの論理構造は、順に時間並列トラックのセットを含んでいるムービーであり得る。ファイルの時間構造は、トラックがサンプルのシーケンスを時間的に含んでいるということであり得、それらのシーケンスは、随意の編集リストによって全体的なムービーのタイムラインにマッピングされる。ファイルの物理構造は、論理、時間、および構造分解のために必要なデータをメディアデータサンプル自体から分離し得る。この構造情報はムービーボックスに集中され、場合によっては、ムービーフラグメントボックスによって時間的に拡張され得る。ムービーボックスは、サンプルの論理およびタイミング関係を記録し得、また、それらが位置するところへのポインタを含み得る。それらのポインタは、たとえば、ＵＲＬによって参照される同じファイルまたは別のファイルへのポインタであり得る。 A file may have a logical structure, a time structure, and a physical structure, and these structures need not be combined. The logical structure of the file can be a movie that in turn contains a set of time-parallel tracks. The temporal structure of the file can be that the track contains a sequence of samples in time, which are mapped to the overall movie timeline by an optional edit list. The physical structure of the file may separate the data needed for logical, time and structural decomposition from the media data sample itself. This structural information is concentrated in the movie box and in some cases can be extended in time by the movie fragment box. Movie boxes may record the logic and timing relationships of samples and may include pointers to where they are located. These pointers can be, for example, pointers to the same file or another file referenced by the URL.

各メディアストリームは、そのメディアタイプ（オーディオ、ビデオなど）のための特殊なトラック中に含まれ得、さらに、サンプルエントリによってパラメータ化され得る。サンプルエントリは、正確なメディアタイプ（ストリームを復号するのに必要なデコーダのタイプ）の「名前」と、必要とされるそのデコーダの任意のパラメータ表示を含み得る。また、その名前は、４文字コード、たとえば、「ｍｏｏｖ」または「ｔｒａｋ」の形態をとり得る。ＭＰＥＧ−４メディアについてだけでなく、このファイルフォーマットファミリーを使用する他の組織によって使用されるメディアタイプについても定義されたサンプルエントリフォーマットがある。 Each media stream may be included in a special track for that media type (audio, video, etc.) and may further be parameterized by a sample entry. The sample entry may include the "name" of the correct media type (the type of decoder required to decode the stream) and any parameter indication of that decoder that is required. The name can also take the form of a four-letter code, for example, “moov” or “trak”. There are sample entry formats defined not only for MPEG-4 media, but also for media types used by other organizations that use this file format family.

メタデータのサポートは、概して２つの形態をとる。第１に、時限メタデータは適切なトラックに記憶され、必要に応じて、その時限メタデータが記述しているメディアデータと同期され得る。第２に、ムービーまたは個々のトラックにアタッチされた非時限メタデータのための一般的なサポートがあり得る。構造サポートは、一般的であり、メディアデータの場合のように、そのファイルまたは別のファイル中の他の場所へのメタデータリソースのストレージを可能にする。さらに、これらのリソースは命名され、保護され得る。 Metadata support generally takes two forms. First, the timed metadata is stored on an appropriate track and can be synchronized with the media data described by the timed metadata as needed. Second, there can be general support for non-timed metadata attached to movies or individual tracks. Structural support is common and allows storage of metadata resources to that file or elsewhere in another file, as is the case with media data. In addition, these resources can be named and protected.

ＩＳＯベースメディアファイルフォーマットでは、サンプルグルーピングは、１つのサンプルグループのメンバーになるように、トラック中のサンプルの各々を割り当てることである。サンプルグループ中のサンプルは、連続である必要はない。たとえば、ＡＶＣファイルフォーマットでＨ．２６４／ＡＶＣを提示するときに、１つの時間レベルでのビデオサンプルは、１つのサンプルグループにサンプリングされ得る。サンプルグループは、２つのデータ構造、ＳａｍｐｌｅＴｏＧｒｏｕｐボックス（ｓｂｄｐ）とＳａｍｐｌｅＧｒｏｕｐＤｅｓｃｒｉｐｔｉｏｎボックスとによって表され得る。ＳａｍｐｌｅＴｏＧｒｏｕｐボックスは、サンプルグループへのサンプルの割当てを表す。対応するグループのプロパティを記述するために、サンプルグループエントリごとにＳａｍｐｌｅＧｒｏｕｐＤｅｓｃｒｉｐｔｉｏｎボックスの１つのインスタンスがあり得る。 In the ISO base media file format, sample grouping is to assign each of the samples in a track to be a member of one sample group. Samples in a sample group need not be contiguous. For example, in the AVC file format, When presenting H.264 / AVC, video samples at one time level may be sampled into one sample group. A sample group can be represented by two data structures: a SampleToGroup box (sbdp) and a SampleGroupDescription box. The SampleToGroup box represents the assignment of samples to sample groups. There can be one instance of the SampleGroupDescription box for each sample group entry to describe the properties of the corresponding group.

随意のメタデータトラックは、それの値がグループの他のメンバーとは異なり得る、それが有する「興味深い特性」（たとえば、それのビットレート、スクリーンサイズ、または言語）と各トラックをタグ付けするために使用され得る。トラック内のいくつかのサンプルは、特殊な特性を有し得るか、または個々に識別され得る。特性の一例は、同期ポイント（しばしばビデオＩフレーム）である。これらのポイントは、各トラック中の特殊なテーブルによって識別され得る。より一般的には、トラックサンプル間の依存性の性質も、メタデータを使用して記録され得る。メタデータは、ちょうどビデオトラックのように一連のファイルフォーマットサンプルとして構造化され得る。そのようなトラックは、メタデータトラックと呼ばれることがある。各メタデータサンプルは、メタデータステートメントとして構造化され得る。対応するファイルフォーマットサンプルまたはそれの構成サンプルに関して尋ねられ得る様々な質問に対応する、様々な種類のステートメントがある。 An optional metadata track to tag each track with its “interesting characteristics” (eg, its bit rate, screen size, or language) that it may have different values than other members of the group Can be used. Some samples in the track may have special characteristics or may be individually identified. An example of a characteristic is a synchronization point (often a video I frame). These points can be identified by special tables in each track. More generally, the nature of dependencies between track samples can also be recorded using metadata. The metadata can be structured as a series of file format samples, just like a video track. Such a track may be referred to as a metadata track. Each metadata sample can be structured as a metadata statement. There are various types of statements that correspond to various questions that can be asked about the corresponding file format sample or its constituent sample.

メディアがストリーミングプロトコルを介して配信されるとき、メディアはそれがファイル中で表される形から変換されることを必要とし得る。これの一例は、メディアがリアルタイムプロトコル（ＲＴＰ）を介して送信される場合である。ファイルでは、たとえば、ビデオの各フレームが、ファイルフォーマットサンプルとして連続して記憶される。ＲＴＰでは、これらのフレームをＲＴＰパケット中に配置するために、使用されるコーデックに固有のパケット化ルールを順守しなければならない。実行時にそのようなパケット化を計算するように、ストリーミングサーバが構成され得る。ただし、ストリーミングサーバの支援のためのサポートがある。ヒントトラックと呼ばれる特殊なトラックがファイル中に配置され得る。 When media is delivered via a streaming protocol, the media may need to be converted from the form it is represented in the file. An example of this is when the media is transmitted via a real time protocol (RTP). In the file, for example, each frame of the video is stored continuously as a file format sample. In RTP, in order to place these frames in RTP packets, the packetization rules specific to the codec used must be observed. A streaming server may be configured to compute such packetization at runtime. However, there is support for supporting streaming servers. Special tracks called hint tracks can be placed in the file.

ヒントトラックは、特定のプロトコルの場合に、メディアトラックからどのようにパケットストリームを形成するかに関する、ストリーミングサーバのための一般的な命令を含んでいる。これらの命令の形態がメディア独立であるので、新しいコーデックが導入されたとき、サーバを修正する必要がないことがある。さらに、符号化および編集ソフトウェアは、ストリーミングサーバに気づいていないことがある。編集がファイル上で完了されると、ファイルにヒントトラックを追加するために、ストリーミングサーバ上にファイルを配置する前にヒンタ（hinter）と呼ばれる１個のソフトウェアが使用され得る。一例として、ＭＰ４ファイルフォーマット仕様においてＲＴＰストリームについて定義されたヒントトラックフォーマットがある。 The hint track contains general instructions for the streaming server on how to form a packet stream from the media track for a particular protocol. Because the form of these instructions is media independent, it may not be necessary to modify the server when a new codec is introduced. Furthermore, the encoding and editing software may not be aware of the streaming server. Once editing is complete on the file, a piece of software called hinter can be used to place the hint track on the file before placing the file on the streaming server. One example is the hint track format defined for RTP streams in the MP4 file format specification.

３ＧＰ（３ＧＰＰファイルフォーマット）は、３ＧＵＭＴＳマルチメディアサービスのために第３世代パートナーシッププロジェクト（３ＧＰＰ）によって定義されたマルチメディアコンテナフォーマットである。それは、一般に３Ｇモバイルフォンおよび他の３Ｇ対応デバイス上で使用されるが、いくつかの２Ｇおよび４Ｇフォンおよびデバイス上でも再生され得る。３ＧＰＰファイルフォーマットは、ＩＳＯベースメディアファイルフォーマットに基づく。最新の３ＧＰは、３ＧＰＰＴＳ２６．２４４、「Transparent end-to-end packet switched streaming service (PSS); 3GPP file format (3GP)」において規定されている。３ＧＰＰファイルフォーマットは、ＭＰＥＧ−４Ｐａｒｔ２またはＨ．２６３またはＭＰＥＧ−４Ｐａｒｔ１０（ＡＶＣ／Ｈ．２６４）としてビデオストリームを記憶する。３ＧＰＰが、ＩＳＯベースメディアファイルフォーマット（ＭＰＥＧ−４Ｐａｒｔ１２）でのサンプルエントリおよびテンプレートフィールドの使用、ならびにコーデックが参照する新しいボックスを定義することを規定しているので、３ＧＰＰは、ＩＳＯベースメディアファイルフォーマットでのＡＭＲおよびＨ．２６３コーデックの使用を可能にする。３ＧＰファイル中のＭＰＥＧ−４メディア固有情報のストレージのために、３ＧＰ仕様はＭＰ４およびＡＶＣファイルフォーマットを参照し、それらのフォーマットもＩＳＯベースメディアファイルフォーマットに基づく。ＭＰ４およびＡＶＣファイルフォーマット仕様は、ＩＳＯベースメディアファイルフォーマットでＭＰＥＧ−４コンテンツの使用を記述している。 3GP (3GPP file format) is a multimedia container format defined by the 3rd Generation Partnership Project (3GPP) for 3G UMTS multimedia services. It is commonly used on 3G mobile phones and other 3G-enabled devices, but can also be played on some 2G and 4G phones and devices. The 3GPP file format is based on the ISO base media file format. The latest 3GP is defined in 3GPP TS 26.244, “Transparent end-to-end packet switched streaming service (PSS); 3GPP file format (3GP)”. The 3GPP file format is MPEG-4 Part 2 or H.264. The video stream is stored as H.263 or MPEG-4 Part 10 (AVC / H.264). Since 3GPP stipulates the use of sample entries and template fields in the ISO base media file format (MPEG-4 Part 12) and the definition of new boxes that codecs refer to, 3GPP is the ISO base media file format. AMR and H. Enables use of H.263 codec. For storage of MPEG-4 media specific information in 3GP files, the 3GP specification refers to MP4 and AVC file formats, which are also based on the ISO base media file format. The MP4 and AVC file format specifications describe the use of MPEG-4 content in the ISO base media file format.

ＳＶＣファイルフォーマットは、ＡＶＣファイルフォーマットの拡張として、エクストラクタおよびティアの新しい構造を有する。エクストラクタは、別のトラック中で等しい復号時間をもつサンプル中のビデオコーディングデータの位置およびサイズに関する情報を与えるポインタである。これは、コーディング領域中にトラック階層を直接構築することを可能にする。ＳＶＣにおけるエクストラクタトラックは、そこから実行時にデータを抽出する１つまたは複数の基本トラックにリンクされる。エクストラクタは、ＳＶＣ拡張の場合、ＮＡＬユニットヘッダをもつデリファレンス可能なポインタである。抽出のために使用されるトラックが、異なるフレームレートのビデオコーディングデータを含んでいる場合、エクストラクタはまた、トラック間の同期性を保証するための復号時間オフセットを含んでいる。実行時に、ストリームがビデオデコーダに受け渡される前に、エクストラクタはそれがポイントするデータと交換されなければならない。 The SVC file format has a new structure of extractors and tiers as an extension of the AVC file format. An extractor is a pointer that gives information about the position and size of video coding data in a sample with equal decoding time in another track. This makes it possible to build the track hierarchy directly in the coding area. Extractor tracks in SVC are linked to one or more basic tracks from which data is extracted at runtime. The extractor is a dereferenceable pointer having a NAL unit header in the case of SVC extension. If the track used for extraction contains video coding data of different frame rates, the extractor also contains a decoding time offset to ensure synchrony between the tracks. At run time, before the stream is passed to the video decoder, the extractor must be exchanged with the data it points to.

ＳＶＣにおけるエクストラクタトラックは、ビデオコーディングトラックのように構造化されるので、そのエクストラクタトラックが必要とするサブセットを異なる形で表し得る。ＳＶＣエクストラクタトラックは、別のトラックからどのようにデータを抽出するかに関する命令のみを含んでいる。ＳＶＣファイルフォーマットでは、また、１つのレイヤ中のＮＡＬユニットをアグリゲータにアグリゲートすることを含む、サンプル内のＮＡＬユニットを１つのＮＡＬユニットとして互いにアグリゲートすることができるアグリゲータがある。ＳＶＣにおけるエクストラクタは、サンプルまたはアグリゲータからある範囲のバイトを抽出するか、またはただ１つのＮＡＬユニット全体であるが複数のＮＡＬユニットではない、特にサンプル中で連続していないものを抽出するように設計される。ＳＶＣファイルフォーマットでは、多くのビデオ動作点があり得る。ティアは、動作点のための１つまたは複数のトラック中のサンプルをグループ化するように設計される。 Since extractor tracks in SVC are structured like video coding tracks, the subsets required by that extractor track may be represented differently. The SVC extractor track contains only instructions on how to extract data from another track. In the SVC file format, there is also an aggregator that can aggregate NAL units in a sample as one NAL unit, including aggregating NAL units in one layer to an aggregator. Extractor in SVC to extract a range of bytes from a sample or aggregator, or to extract just one NAL unit but not multiple NAL units, especially non-contiguous ones in a sample Designed. There can be many video operating points in the SVC file format. A tier is designed to group samples in one or more tracks for an operating point.

また、ＭＶＣファイルフォーマットは、エクストラクタトラックをサポートし、エクストラクタトラックは、あるフレームレートでのビューのサブセットである動作点を形成するために、異なるビューからＮＡＬユニットを抽出する。ＭＶＣエクストラクタトラックの設計は、ＳＶＣファイルフォーマットにおけるエクストラクタと同様である。ただし、代替グループを形成するためにＭＶＣエクストラクタトラックを使用することはサポートされない。トラック選択をサポートするために、以下のＭＰＥＧ提案、Ｐ．Ｆｒｏｊｄｈ、Ａ．Ｎｏｒｋｉｎ、およびＣ．Ｐｒｉｄｄｌｅ、「File format sub-track selection and switching」、ＩＳＯ／ＩＥＣＪＴＣ１／ＳＣ２９／ＷＧ１１ＭＰＥＧＭ１６６６５、英国、ロンドンがＭＰＥＧに提案されている。この提案は、サブトラックレベルにおいて代替／スイッチグループ概念を可能にすることを試みている。 The MVC file format also supports extractor tracks, which extract NAL units from different views to form operating points that are a subset of the view at a certain frame rate. The design of the MVC extractor track is the same as the extractor in the SVC file format. However, using MVC extractor tracks to form alternate groups is not supported. To support track selection, the following MPEG proposal, P.I. Frojdh, A.M. Norkin, and C.I. PRIDELE, “File format sub-track selection and switching”, ISO / IEC JTC1 / SC29 / WG11 MPEG M16665, UK and London are proposed for MPEG. This proposal attempts to enable an alternative / switch group concept at the subtrack level.

マップサンプルグループは、サンプルグループに対する拡張である。マップサンプルグループでは、（サンプルの）各グループエントリは、場合によっては、ビュー中のＮＡＬユニットを１つのＮＡＬユニットにアグリゲートした後の、実際にｖｉｅｗ＿ｉｄへのマップである「ｇｒｏｕｐＩＤ」についてのそれの記述を有する。言い換えれば、各サンプルグループエントリには、それの含んでいるビューがＳｃａｌａｂｌｅＮＡＬＵＭａｐＥｎｔｒｙ値に記載されている。このサンプルグループエントリのｇｒｏｕｐｉｎｇ＿ｔｙｐｅは「ｓｃｎｍ」である。 The map sample group is an extension to the sample group. In a map sample group, each group entry (in the sample) may have its own “groupID”, which is actually a map to view_id after aggregating the NAL units in the view into one NAL unit. Have a description. In other words, each sample group entry describes the view it contains in the ScalableNALUMapEntry value. The grouping_type of this sample group entry is “scnm”.

プログレッシブダウンロードは、一般にＨＴＴＰプロトコルを使用する、サーバからクライアントへのデジタルメディアファイルの転送を説明するために使用される用語である。コンピュータから起動されたとき、ダウンロードが完了する前に消費者はメディアの再生を開始し得る。ストリーミングメディアとプログレッシブダウンロードとの間の主な違いは、どのようにデジタルメディアデータが受信され、デジタルメディアにアクセスしているエンドユーザデバイスによって記憶されるかにある。プログレッシブダウンロード再生が可能であるメディアプレーヤは、元のままである、ファイルのヘッダ中にあるメタデータと、ウェブサーバからダウンロードされたときのデジタルメディアファイルのローカルバッファとを利用する。指定されたデータ量がローカル再生デバイスに利用可能になるポイントにおいて、メディアは再生を開始する。この指定されたバッファ量は、エンコーダ設定においてコンテンツの製作者によってファイルに埋め込まれ、メディアプレーヤによって課される追加のバッファ設定によって補強される。 Progressive download is a term used to describe the transfer of digital media files from a server to a client, typically using the HTTP protocol. When booted from the computer, the consumer can begin playing the media before the download is complete. The main difference between streaming media and progressive download is in how the digital media data is received and stored by the end user device accessing the digital media. A media player capable of progressive download playback uses the original metadata in the file header and the local buffer of the digital media file when downloaded from the web server. At the point where the specified amount of data becomes available to the local playback device, the media begins to play. This specified buffer amount is embedded in the file by the content producer in the encoder settings and is augmented by additional buffer settings imposed by the media player.

３ＧＰＰでは、ダウンロードおよびプログレッシブダウンロードのために３ＧＰファイルについてＨＴＴＰ／ＴＣＰ／ＩＰトランスポートがサポートされる。さらに、ビデオストリーミングのためにＨＴＴＰを使用することにはいくつかの利点があり、ＨＴＴＰに基づくビデオストリーミングサービスが普及してきている。ＨＴＴＰストリーミングのいくつかの利点は、既存のインターネット構成要素およびプロトコルが使用され得、それによりネットワークを介してビデオデータをトランスポートするための新しい技法を開発する新たな努力が必要でないことを含む。他のトランスポートプロトコル、たとえば、ＲＴＰペイロードフォーマットは、メディアフォーマットとシグナリングコンテキストとに気づくように、中間ネットワークデバイス、たとえば、中間ボックスを必要とする。また、ＨＴＴＰストリーミングは、多くの制御問題を回避するクライアント駆動型とすることができる。たとえば、最適パフォーマンスを得るためのすべての特徴を活用するために、サーバは、まだ確認されていないパケットのサイズおよびコンテンツを監視し得る。また、サーバはファイル構造を分析し、ＲＤ最適スイッチング／細線化（thinning）決定を行うために、クライアントバッファの状態を再構成し得る。さらに、ネゴシエートされたプロファイルに準拠したままでいるために、ビットストリーム変形体に対する制約が満たされ得る。ＨＴＴＰは、ＨＴＴＰ１．１が実装されているウェブサーバにおける、新しいハードウェアまたはソフトウェア実装形態を必ずしも必要としない。また、ＨＴＴＰストリーミングはＴＣＰ親和性とファイアウォール横断とを与える。本開示の技法は、たとえば、ビットレート適応を与えることによって、ビデオデータのＨＴＴＰストリーミングを改善して、帯域幅に関係する問題を克服し得る。 3GPP supports HTTP / TCP / IP transport for 3GP files for download and progressive download. In addition, using HTTP for video streaming has several advantages, and video streaming services based on HTTP have become widespread. Some advantages of HTTP streaming include that existing Internet components and protocols can be used, thereby eliminating the need for new efforts to develop new techniques for transporting video data over the network. Other transport protocols, such as the RTP payload format, require an intermediate network device, such as an intermediate box, to be aware of the media format and signaling context. HTTP streaming can also be client driven to avoid many control problems. For example, to take advantage of all the features for optimal performance, the server may monitor the size and content of packets that have not yet been confirmed. The server can also reconfigure the client buffer state to analyze the file structure and make RD optimal switching / thinning decisions. In addition, constraints on the bitstream variants may be met to remain compliant with the negotiated profile. HTTP does not necessarily require a new hardware or software implementation on a web server where HTTP 1.1 is implemented. HTTP streaming also provides TCP affinity and firewall traversal. The techniques of this disclosure may improve HTTP streaming of video data, for example, by providing bit rate adaptation to overcome bandwidth related problems.

ＩＴＵ−ＴＨ．２６１、Ｈ．２６２、Ｈ．２６３、ＭＰＥＧ−１、ＭＰＥＧ−２およびＨ．２６４／ＭＰＥＧ−４ｐａｒｔ１０などのビデオ圧縮規格は、時間冗長性を低減するために動き補償時間予測を利用する。エンコーダは、動きベクトルに従って現在のコード化ピクチャを予測するために、いくつかの前の（本明細書ではフレームとも呼ぶ）符号化ピクチャからの動き補償予測を使用する。典型的なビデオコーディングには３つの主要なピクチャタイプがある。それらは、イントラコード化ピクチャ（「Ｉピクチャ」または「Ｉフレーム」）と、予測ピクチャ（「Ｐピクチャ」または「Ｐフレーム」）と、双方向予測ピクチャ（「Ｂピクチャ」または「Ｂフレーム」）とである。Ｐピクチャのブロックは、１つの他のピクチャに関してイントラコード化または予測され得る。Ｂピクチャでは、ブロックは、１つまたは２つの参照ピクチャから予測され得るか、またはイントラコード化され得る。これらの参照ピクチャは、時間順序で現在のピクチャの前または後に位置し得る。 ITU-TH. 261, H.H. 262, H.C. H.263, MPEG-1, MPEG-2 and H.264. Video compression standards such as H.264 / MPEG-4 part 10 utilize motion compensated temporal prediction to reduce temporal redundancy. The encoder uses motion compensated prediction from several previous coded pictures (also referred to herein as frames) to predict the current coded picture according to the motion vector. There are three main picture types in typical video coding. They are an intra-coded picture (“I picture” or “I frame”), a predictive picture (“P picture” or “P frame”), and a bi-predictive picture (“B picture” or “B frame”). It is. A block of P pictures may be intra-coded or predicted with respect to one other picture. For B pictures, a block can be predicted from one or two reference pictures or can be intra-coded. These reference pictures may be located before or after the current picture in temporal order.

Ｈ．２６４コーディング規格によれば、一例として、Ｂピクチャは、前にコーディングされた参照ピクチャの２つのリスト、すなわち、リスト０とリスト１とを使用する。これらの２つのリストは、それぞれ、過去および／または将来のコード化ピクチャを時間順序で含むことができる。Ｂピクチャ中のブロックは、いくつかの方法、すなわちリスト０参照ピクチャからの動き補償予測、リスト１参照ピクチャからの動き補償予測、またはリスト０参照ピクチャとリスト１参照ピクチャの両方の組合せからの動き補償予測のうちの１つで予測され得る。リスト０参照ピクチャとリスト１参照ピクチャの両方の組合せを得るために、２つの動き補償基準エリアが、それぞれリスト０参照ピクチャおよびリスト１参照ピクチャから取得される。それらの組合せは現在のブロックを予測するために使用される。 H. According to the H.264 coding standard, by way of example, a B picture uses two lists of previously coded reference pictures: list 0 and list 1. Each of these two lists may include past and / or future coded pictures in time order. Blocks in B pictures can be moved in several ways: motion compensated prediction from a list 0 reference picture, motion compensated prediction from a list 1 reference picture, or a combination of both a list 0 reference picture and a list 1 reference picture. It can be predicted with one of the compensated predictions. To obtain a combination of both the list 0 reference picture and the list 1 reference picture, two motion compensation reference areas are obtained from the list 0 reference picture and the list 1 reference picture, respectively. Their combination is used to predict the current block.

より小さいビデオブロックは、より良好な解像度を与えることができ、高い詳細レベルを含むビデオフレームのロケーションのために使用され得る。一般に、マクロブロックおよび様々なパーティションはサブブロックと呼ばれることがあり、ビデオブロックと見なされ得る。さらに、スライスは、マクロブロックおよび／またはサブブロックなどの複数のビデオブロックであると見なされ得る。各スライスはビデオフレームの単独で復号可能なユニットであり得る。代替的に、フレーム自体が復号可能なユニットであり得るか、またはフレームの他の部分が復号可能なユニットとして定義され得る。「コード化ユニット」または「コーディングユニット」という用語は、フレーム全体、フレームのスライス、シーケンスとも呼ばれるピクチャグループ（ＧＯＰ）など、ビデオフレームの単独で復号可能な任意のユニット、または適用可能なコーディング技法に従って定義される別の単独で復号可能なユニットを指し得る。 Smaller video blocks can give better resolution and can be used for the location of video frames containing high levels of detail. In general, macroblocks and various partitions may be referred to as sub-blocks and may be considered video blocks. Further, a slice may be considered as multiple video blocks such as macroblocks and / or sub-blocks. Each slice may be a single decodable unit of a video frame. Alternatively, the frame itself can be a decodable unit, or other part of the frame can be defined as a decodable unit. The term “coding unit” or “coding unit” refers to any unit that can be decoded independently of a video frame, such as an entire frame, a slice of a frame, a picture group (GOP), also referred to as a sequence, or an applicable coding technique. It may refer to another independently decodable unit defined.

マクロブロックという用語は、１６×１６ピクセルを備える２次元ピクセルアレイに従ってピクチャおよび／またはビデオデータを符号化するためのデータ構造を指す。各ピクセルはクロミナンス成分と輝度成分とを備える。したがって、マクロブロックは、各々が８×８ピクセルの２次元アレイを備える４つの輝度ブロックと、各々が１６×１６ピクセルの２次元アレイを備える２つのクロミナンスブロックと、コード化ブロックパターン（ＣＢＰ）、符号化モード（たとえば、イントラ（Ｉ）またはインター（ＰまたはＢ）符号化モード）、イントラ符号化ブロックのパーティションのパーティションサイズ（たとえば、１６×１６、１６×８、８×１６、８×８、８×４、４×８、または４×４）、あるいはインター符号化マクロブロックのための１つまたは複数の動きベクトルなど、シンタックス情報を備えるヘッダとを定義し得る。 The term macroblock refers to a data structure for encoding picture and / or video data according to a two-dimensional pixel array comprising 16 × 16 pixels. Each pixel has a chrominance component and a luminance component. Thus, a macroblock consists of four luminance blocks each comprising a two-dimensional array of 8 × 8 pixels, two chrominance blocks each comprising a two-dimensional array of 16 × 16 pixels, and a coded block pattern (CBP), Coding mode (eg, intra (I) or inter (P or B) coding mode), partition size of intra coding block partition (eg, 16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, or 4 × 4), or a header with syntax information, such as one or more motion vectors for inter-coded macroblocks.

ビデオエンコーダ２８、ビデオデコーダ４８、オーディオエンコーダ２６、オーディオデコーダ４６、マルチプレクサ３０、およびデマルチプレクサ３８は、それぞれ、適用可能なとき、１つまたは複数のマイクロプロセッサ、デジタル信号プロセッサ（ＤＳＰ）、特定用途向け集積回路（ＡＳＩＣ）、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、ディスクリート論理、ソフトウェア、ハードウェア、ファームウェアなどの様々な好適なエンコーダまたはデコーダ回路のいずれか、またはそれらの任意の組合せとして実装され得る。ビデオエンコーダ２８およびビデオデコーダ４８の各々は１つまたは複数のエンコーダまたはデコーダ中に含められ得、そのいずれかは複合ビデオエンコーダ／デコーダ（ＣＯＤＥＣ）の一部として統合され得る。同様に、オーディオエンコーダ２６およびオーディオデコーダ４６の各々は１つまたは複数のエンコーダまたはデコーダ中に含められ得、そのいずれかは複合オーディオエンコーダ／デコーダ（ＣＯＤＥＣ）の一部として統合され得る。ビデオエンコーダ２８、ビデオデコーダ４８、オーディオエンコーダ２６、オーディオデコーダ４６、マルチプレクサ３０、および／またはデマルチプレクサ３８を含む装置は、集積回路、マイクロプロセッサ、および／またはセルラー電話などのワイヤレス通信デバイスを備え得る。 Video encoder 28, video decoder 48, audio encoder 26, audio decoder 46, multiplexer 30, and demultiplexer 38 are each one or more microprocessors, digital signal processors (DSPs), application specific, as applicable. It may be implemented as any of a variety of suitable encoder or decoder circuits, such as an integrated circuit (ASIC), field programmable gate array (FPGA), discrete logic, software, hardware, firmware, or any combination thereof. Each of video encoder 28 and video decoder 48 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined video encoder / decoder (CODEC). Similarly, each of audio encoder 26 and audio decoder 46 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined audio encoder / decoder (CODEC). An apparatus that includes video encoder 28, video decoder 48, audio encoder 26, audio decoder 46, multiplexer 30, and / or demultiplexer 38 may comprise an integrated circuit, a microprocessor, and / or a wireless communication device such as a cellular telephone.

本開示の技法によれば、マルチプレクサ３０は、ＮＡＬユニットを、ＩＳＯベースメディアファイルフォーマットまたはその派生（たとえば、ＳＶＣ、ＡＶＣ、ＭＶＣ、または３ＧＰＰ）に準拠するビデオファイルのトラックにアセンブルし、別のトラックの１つまたは複数の潜在的な非連続ＮＡＬユニットを識別するメディアエクストラクタトラックを含み、ビデオファイルを出力インターフェース３２に受け渡し得る。出力インターフェース３２は、たとえば、送信機、トランシーバ、たとえば、オプティカルドライブ、磁気メディアドライブ（たとえば、フロッピー（登録商標）ドライブ）など、コンピュータ可読媒体にデータを書き込むためのデバイス、ユニバーサルシリアルバス（ＵＳＢ）ポート、ネットワークインターフェース、または他の出力インターフェースを備え得る。出力インターフェース３２は、ＮＡＬユニットまたはアクセスユニットを、コンピュータ可読媒体３４、たとえば、送信信号または搬送波などの一時媒体、あるいは磁気メディア、光メディア、メモリ、またはフラッシュドライブなどのコンピュータ可読記憶媒体に出力する。 In accordance with the techniques of this disclosure, multiplexer 30 assembles a NAL unit into a track of a video file that conforms to an ISO base media file format or a derivative thereof (eg, SVC, AVC, MVC, or 3GPP) A media extractor track that identifies one or more potential non-contiguous NAL units may be passed and the video file may be passed to the output interface 32. The output interface 32 is a device for writing data to a computer readable medium, such as a transmitter, transceiver, eg, optical drive, magnetic media drive (eg, floppy drive), universal serial bus (USB) port, for example. A network interface or other output interface. The output interface 32 outputs the NAL unit or access unit to a computer readable medium 34, eg, a temporary medium such as a transmission signal or carrier wave, or a computer readable storage medium such as magnetic media, optical media, memory, or flash drive.

入力インターフェース３６はコンピュータ可読媒体３４からデータを取り出す。入力インターフェース３６は、たとえば、オプティカルドライブ、磁気媒体ドライブ、ＵＳＢポート、受信機、トランシーバ、または他のコンピュータ可読媒体インターフェースを備え得る。入力インターフェース３６は、ＮＡＬユニットまたはアクセスユニットをデマルチプレクサ３８に与え得る。デマルチプレクサ３８は、トランスポートストリームまたはプログラムストリームを構成ＰＥＳストリームに多重分離し、符号化データを取り出すためにＰＥＳストリームをパケット化解除し、たとえば、ストリームのＰＥＳパケットヘッダによって示されるように、符号化データがオーディオまたはビデオストリームの一部であるかどうかに応じて、符号化データをオーディオデコーダ４６またはビデオデコーダ４８に送り得る。デマルチプレクサ３８は、初めに、受信したビデオファイル中に含まれるトラックのうちの１つを選択し、次いで、選択されたトラックのデータと、選択されたトラックのエクストラクタによって参照される他のトラックのデータとのみをビデオデコーダ４８に受け渡し得、選択されたトラックのエクストラクタによって参照されない他のトラックのデータを廃棄する。オーディオデコーダ４６は、符号化オーディオデータを復号し、復号されたオーディオデータをオーディオ出力４２に送り、ビデオデコーダ４８は、符号化ビデオデータを復号し、ストリームの複数のビューを含み得る復号されたビデオデータをビデオ出力４４に送る。ビデオ出力４４は、シーンの複数のビュー、たとえばシーンの各ビューを同時に提示する立体視または自動立体視ディスプレイを使用するディスプレイを備え得る。 Input interface 36 retrieves data from computer readable medium 34. The input interface 36 may comprise, for example, an optical drive, magnetic media drive, USB port, receiver, transceiver, or other computer readable media interface. Input interface 36 may provide a NAL unit or access unit to demultiplexer 38. The demultiplexer 38 demultiplexes the transport stream or program stream into constituent PES streams, depacketizes the PES stream to retrieve encoded data, and encodes, for example, as indicated by the PES packet header of the stream Depending on whether the data is part of an audio or video stream, the encoded data may be sent to an audio decoder 46 or a video decoder 48. The demultiplexer 38 first selects one of the tracks contained in the received video file, then the data for the selected track and other tracks referenced by the selected track extractor. And the data of the other tracks that are not referenced by the extractor of the selected track are discarded. The audio decoder 46 decodes the encoded audio data and sends the decoded audio data to the audio output 42, and the video decoder 48 decodes the encoded video data and may contain decoded views of the stream. Send data to video output 44. Video output 44 may comprise a display using a stereoscopic or autostereoscopic display that presents multiple views of the scene simultaneously, eg, each view of the scene.

図２は、マルチプレクサ３０（図１）の構成要素の例示的な構成を示すブロック図である。図２の例では、マルチプレクサ３０は、ストリーム管理ユニット６０と、ビデオ入力インターフェース８０と、オーディオ入力インターフェース８２と、多重化ストリーム出力インターフェース８４と、プログラム固有情報テーブル８８とを含む。ストリーム管理ユニット６０は、ＮＡＬユニットコンストラクタ６２と、ストリーム識別子（ストリームＩＤ）ルックアップユニット６６と、トラック生成ユニット６４と、エクストラクタ生成ユニット６８とを含む。 FIG. 2 is a block diagram illustrating an exemplary configuration of components of multiplexer 30 (FIG. 1). In the example of FIG. 2, the multiplexer 30 includes a stream management unit 60, a video input interface 80, an audio input interface 82, a multiplexed stream output interface 84, and a program specific information table 88. The stream management unit 60 includes a NAL unit constructor 62, a stream identifier (stream ID) lookup unit 66, a track generation unit 64, and an extractor generation unit 68.

図２の例では、ビデオ入力インターフェース８０およびオーディオ入力インターフェース８２は、符号化ビデオデータおよび符号化オーディオデータからＰＥＳユニットを形成するためにそれぞれのパケッタイザを含む。他の例では、ビデオおよび／またはオーディオパケッタイザは、マルチプレクサ３０の外部に存在し得る。図２の例に関して、ビデオ入力インターフェース８０は、ビデオエンコーダ２８から受信された符号化ビデオデータからＰＥＳパケットを形成し得、オーディオ入力インターフェース８２は、オーディオエンコーダ２６から受信された符号化オーディオデータからＰＥＳパケットを形成し得る。 In the example of FIG. 2, video input interface 80 and audio input interface 82 include respective packetizers to form a PES unit from encoded video data and encoded audio data. In other examples, the video and / or audio packetizer may be external to the multiplexer 30. With respect to the example of FIG. 2, video input interface 80 may form PES packets from encoded video data received from video encoder 28, and audio input interface 82 may detect PES from encoded audio data received from audio encoder 26. Packets can be formed.

ＮＡＬユニットコンストラクタ６２がＮＡＬユニットを構築した後、ＮＡＬユニットコンストラクタ６２はＮＡＬユニットをトラック生成ユニット６４に送る。トラック生成ユニット６４は、ＮＡＬユニットを受信し、ビデオファイルの１つまたは複数のトラック中のＮＡＬユニットを含むビデオファイルをアセンブルする。トラック生成ユニット６４は、さらに、トラック生成ユニット６４によって構築された１つまたは複数のメディアエクストラクタトラックのためのエクストラクタを生成するために、エクストラクタ生成ユニット６８を実行し得る。１つまたは複数のＮＡＬユニットが複数のトラックに属すると判断されたとき、トラック間でＮＡＬユニットを複製するのではなく、エクストラクタ生成ユニット６８は、ＮＡＬユニットを参照するトラックのためのエクストラクタを構築し得る。このようにして、マルチプレクサ３０はトラック間のデータの重複を回避し得、それにより、ビデオファイルを送信するときの帯域幅消費量を低減し得る。 After the NAL unit constructor 62 constructs the NAL unit, the NAL unit constructor 62 sends the NAL unit to the track generation unit 64. A track generation unit 64 receives the NAL unit and assembles a video file that includes the NAL unit in one or more tracks of the video file. Track generation unit 64 may further execute extractor generation unit 68 to generate extractors for one or more media extractor tracks constructed by track generation unit 64. When it is determined that one or more NAL units belong to multiple tracks, the extractor generation unit 68 does not duplicate the NAL units between the tracks, but the extractor generation unit 68 selects an extractor for the track that references the NAL unit. Can be built. In this way, multiplexer 30 can avoid duplication of data between tracks, thereby reducing bandwidth consumption when transmitting video files.

エクストラクタのためのデータ構造および構成要素の様々な例について、以下に説明する。概して、エクストラクタは、参照されるＮＡＬユニットが含まれるトラックを参照するトラック識別子値と、エクストラクタによって参照されるＮＡＬユニットを識別する１つまたは複数のＮＡＬユニット識別子とを含み得る。いくつかの例では、ＮＡＬユニット識別子は、識別されるＮＡＬユニットに対応するトラック識別子値によって参照されるトラック中のビットまたはバイト範囲を参照し得る。いくつかの例では、たとえば、非連続ＮＡＬユニットを識別するために、ＮＡＬユニット識別子は、エクストラクタによって識別される各ＮＡＬユニットを個々に参照し得る。いくつかの例では、ＮＡＬユニット識別子は、メディアエクストラクタトラック中のエクストラクタの時間または空間ロケーションからのオフセットに基づいて、ＮＡＬユニットを参照し得る。 Various examples of data structures and components for extractors are described below. In general, an extractor may include a track identifier value that references a track that includes a referenced NAL unit, and one or more NAL unit identifiers that identify the NAL unit referenced by the extractor. In some examples, the NAL unit identifier may refer to a bit or byte range in the track referenced by the track identifier value corresponding to the identified NAL unit. In some examples, for example, to identify non-contiguous NAL units, the NAL unit identifier may individually reference each NAL unit identified by the extractor. In some examples, the NAL unit identifier may reference a NAL unit based on an offset from the extractor's time or spatial location in the media extractor track.

トラック生成ユニット６４は、いくつかの例では、メディアエクストラクタトラック中に追加のＮＡＬユニットを含み得る。すなわち、メディアエクストラクタトラックは、ＮＡＬユニットとエクストラクタとを含み得る。したがって、いくつかの例では、トラック生成ユニット６４は、ＮＡＬユニットのみを含む第１のトラックと、第１のトラックのＮＡＬユニットのすべてまたはそのサブセットを参照する１つまたは複数のエクストラクタを含む第２のトラックとを有するビデオファイルを構築し得る。その上、いくつかの例では、トラック生成ユニット６４は、第１のトラック中に含まれない追加のＮＡＬユニットを第２のトラック中に含み得る。同様に、本開示の技法は、複数のトラックに拡張され得る。たとえば、トラック生成ユニット６４は、第１のトラックのＮＡＬユニットおよび／または第２のトラックのＮＡＬユニットを参照し得る第３のトラックを構築し得、第１または第２のトラック中に含まれないＮＡＬユニットをさらに含み得る。 Track generation unit 64 may include additional NAL units in the media extractor track in some examples. That is, the media extractor track may include a NAL unit and an extractor. Thus, in some examples, the track generation unit 64 includes a first track that includes only NAL units, and a first track that includes one or more extractors that reference all or a subset of the NAL units of the first track. A video file having two tracks can be constructed. Moreover, in some examples, the track generation unit 64 may include additional NAL units in the second track that are not included in the first track. Similarly, the techniques of this disclosure may be extended to multiple tracks. For example, the track generation unit 64 may build a third track that may refer to the NAL unit of the first track and / or the NAL unit of the second track and is not included in the first or second track. It may further include a NAL unit.

図３は、ビデオサンプルのセットを有する第１のトラックと、第１のトラックのビデオサンプルのサブセットを参照するエクストラクタを有する第２のトラックとを含む例示的なファイル１００を示すブロック図である。図３の例では、ファイル１００は、ＭＯＯＶボックス１０２とメディアデータ（ＭＤＡＴ）ボックス１１０とを含む。ＭＯＯＶボックス１０２は、ムービーボックスに対応し、ＩＳＯベースメディアファイルフォーマットは、そのムービーボックスを、サブボックスがプレゼンテーションのためのメタデータを定義するコンテナボックスとして定義する。ＭＤＡＴボックス１０４はメディアデータボックスに対応し、ＩＳＯベースメディアファイルフォーマットは、そのメディアデータボックスを、プレゼンテーションのための実際のデータを保持することができるボックスとして定義する。 FIG. 3 is a block diagram illustrating an example file 100 that includes a first track having a set of video samples and a second track having an extractor that references a subset of the video samples of the first track. . In the example of FIG. 3, the file 100 includes a MOOV box 102 and a media data (MDAT) box 110. The MOOV box 102 corresponds to a movie box, and the ISO base media file format defines the movie box as a container box in which sub-boxes define metadata for presentation. The MDAT box 104 corresponds to a media data box, and the ISO base media file format defines the media data box as a box that can hold actual data for presentation.

図３の例では、ＭＯＯＶボックス１０２は、完全なサブセットトラック１０４とメディアエクストラクタトラック１０６とを含む。ＩＳＯベースメディアファイルフォーマットは、ＩＳＯベースメディアファイル中の関連するサンプルの時限シーケンスとして「トラック」を定義している。ＩＳＯベースメディアファイルフォーマットは、さらに、メディアデータについて、トラックが一連の画像またはサンプリングされたオーディオに対応することに言及している。 In the example of FIG. 3, the MOOV box 102 includes a complete subset track 104 and a media extractor track 106. The ISO base media file format defines a “track” as a timed sequence of related samples in an ISO base media file. The ISO base media file format further mentions that for media data, a track corresponds to a series of images or sampled audio.

ＭＤＡＴボックス１１０は、図３の例では、Ｉ符号化サンプル１１２と、Ｐ符号化サンプル１１４と、Ｂ符号化サンプル１１６と、Ｂ符号化サンプル１１８とを含む。Ｂ符号化サンプル１１６およびＢ符号化サンプル１１８は、異なる階層符号化レベルであると見なされる。図３の例では、Ｂ符号化サンプル１１６は、Ｂ符号化サンプル１１８のための参照として使用され得、したがって、Ｂ符号化サンプル１１８は、Ｂ符号化サンプル１１６の階層符号化レベルよりも低い階層符号化レベルであり得る。サンプルの表示順序は、（復号順序とも呼ばれる）階層順序、およびサンプルがＭＤＡＴボックス１１０中に含まれる順序とは異なり得る。たとえば、Ｉ符号化サンプル１１２は表示順序値０と復号順序値０とを有し得、Ｐ符号化サンプル１１４は表示順序値２と復号順序値１とを有し得、Ｂ符号化サンプル１１６は表示順序値１と復号順序値２とを有し得、Ｂ符号化サンプル１１８は表示順序値４と復号順序値３とを有し得る。トラック１は、追加のサンプル、たとえば、表示順序値３と復号順序値４とをもつサンプルを含み得る。 In the example of FIG. 3, the MDAT box 110 includes an I encoded sample 112, a P encoded sample 114, a B encoded sample 116, and a B encoded sample 118. B coded sample 116 and B coded sample 118 are considered to be at different hierarchical coding levels. In the example of FIG. 3, B coded sample 116 may be used as a reference for B coded sample 118, and therefore B coded sample 118 is a hierarchy lower than the hierarchical coding level of B coded sample 116. It can be an encoding level. The display order of the samples may be different from the hierarchical order (also called decoding order) and the order in which the samples are included in the MDAT box 110. For example, the I encoded sample 112 may have a display order value 0 and a decoding order value 0, the P encoded sample 114 may have a display order value 2 and a decoding order value 1, and the B encoded sample 116 may be The display order value 1 and the decoding order value 2 may be included, and the B encoded sample 118 may have the display order value 4 and the decoding order value 3. Track 1 may include additional samples, eg, samples with display order value 3 and decoding order value 4.

Ｉ符号化サンプル１１２、Ｐ符号化サンプル１１４、Ｂ符号化サンプル１１６、およびＢ符号化サンプル１１８の各々は、様々なＮＡＬユニットまたはアクセスユニットに対応し得る。ＩＳＯベースメディアファイルフォーマットは、単一のタイムスタンプに関連するすべてのデータ、たとえば、ビデオの個々のフレーム、復号順序での一連のビデオフレーム、または復号順序でのオーディオの圧縮セクションとして「サンプル」を定義している。完全なサブセットトラック１０４は、図３の例では、Ｉ符号化サンプル１１２と、Ｐ符号化サンプル１１４と、Ｂ符号化サンプル１１６と、Ｂ符号化サンプル１１８とを参照するメタデータを含む。 Each of I-coded sample 112, P-coded sample 114, B-coded sample 116, and B-coded sample 118 may correspond to various NAL units or access units. The ISO base media file format takes a “sample” as a compressed section of all data associated with a single time stamp, eg, individual frames of video, a series of video frames in decoding order, or audio in decoding order. Defined. The complete subset track 104 includes metadata referring to the I encoded sample 112, the P encoded sample 114, the B encoded sample 116, and the B encoded sample 118 in the example of FIG.

ＭＤＡＴボックス１１０は、エクストラクタ１２０と、エクストラクタ１２２と、エクストラクタ１２４とをさらに含む。したがって、エクストラクタ１２０〜１２４は、概してデータのサンプルを含むであろうムービーデータボックス中に含まれる。図３の例では、エクストラクタ１２０は、Ｉ符号化サンプル１１２を参照し、エクストラクタ１２２は、Ｐ符号化サンプル１１４を参照し、エクストラクタ１２４は、Ｂ符号化サンプル１１８を参照する。Ｉ符号化サンプル１１２、Ｐ符号化サンプル１１４、および／またはＢ符号化サンプル１１８に対応する２つ以上のＮＡＬユニットがあり得、そのＮＡＬユニットは非連続であり得る。本開示の技法によれば、対応するサンプル中に２つ以上の非連続ＮＡＬユニットがあり得るとしても、エクストラクタ１２０〜１２４は、それにもかかわらず、対応するサンプルのＮＡＬユニットの各々を識別し得る。メディアエクストラクタトラック１０６は、図３の例では、エクストラクタ１２０とエクストラクタ１２２とエクストラクタ１２４とを参照するメタデータを含む。 The MDAT box 110 further includes an extractor 120, an extractor 122, and an extractor 124. Accordingly, the extractors 120-124 are generally included in a movie data box that will contain a sample of data. In the example of FIG. 3, the extractor 120 refers to the I encoded sample 112, the extractor 122 refers to the P encoded sample 114, and the extractor 124 refers to the B encoded sample 118. There may be two or more NAL units corresponding to the I encoded sample 112, the P encoded sample 114, and / or the B encoded sample 118, and the NAL units may be non-contiguous. According to the techniques of this disclosure, even though there may be more than one non-contiguous NAL unit in the corresponding sample, the extractors 120-124 nevertheless identify each of the corresponding sample NAL units. obtain. In the example of FIG. 3, the media extractor track 106 includes metadata that refers to the extractor 120, the extractor 122, and the extractor 124.

また、エクストラクタ１２０〜１２４の各々は、表示順序値と復号順序値とを含み得る。たとえば、エクストラクタ１２０は、表示順序値０と復号順序値０と有し得、エクストラクタ１２２は、表示順序値１と復号順序値１とを有し得、エクストラクタ１２４は、表示順序値２と復号順序値２とを有し得る。いくつかの例では、表示および／または復号値は、たとえば、識別されたサンプルの値を整合させるために、いくつかの値をスキップし得る。 In addition, each of the extractors 120 to 124 may include a display order value and a decoding order value. For example, the extractor 120 may have a display order value 0 and a decoding order value 0, the extractor 122 may have a display order value 1 and a decoding order value 1, and the extractor 124 may have a display order value 2 And a decoding order value of 2. In some examples, the displayed and / or decoded values may skip some values, for example, to match the values of the identified samples.

完全なサブセットトラック１０４とメディアエクストラクタトラック１０６とは代替グループを形成し得、それにより、デマルチプレクサ３８（図１）は、ビデオデコーダ４８によって復号されるべき、完全なサブセットトラック１０４またはメディアエクストラクタトラック１０６のいずれかを選択し得る。ＭＶＣの例に関して、完全なサブセットトラック１０４は第１の動作点に対応し得、メディアエクストラクタトラック１０６は第２の動作点に対応し得る。３ＧＰＰの例に関して、完全なサブセットトラック１０４とメディアエクストラクタトラック１０６とは、スイッチグループを形成し得る。このようにして、たとえば、ＨＴＴＰストリーミングアプリケーションにおける帯域幅可用性とデコーダ能力とを適応させるために、完全なサブセットトラック１０４とメディアエクストラクタトラック１０６とが使用され得る。 The complete subset track 104 and the media extractor track 106 may form an alternate group so that the demultiplexer 38 (FIG. 1) is to be decoded by the video decoder 48 to the complete subset track 104 or media extractor. Any of the tracks 106 may be selected. For the MVC example, the complete subset track 104 may correspond to a first operating point and the media extractor track 106 may correspond to a second operating point. For the 3GPP example, the complete subset track 104 and media extractor track 106 may form a switch group. In this way, the complete subset track 104 and media extractor track 106 can be used, for example, to adapt bandwidth availability and decoder capabilities in HTTP streaming applications.

完全なサブセットトラック１０４が選択されたとき、デマルチプレクサ３８は、完全なサブセットトラック１０４に対応するサンプル（たとえば、Ｉ符号化サンプル１１２、Ｐ符号化サンプル１１４、Ｂ符号化サンプル１１６、およびＢ符号化サンプル１１８）をビデオデコーダ４８に送り得る。メディアエクストラクタトラック１０６が選択されたとき、デマルチプレクサ３８は、メディアエクストラクタトラック１０６に対応するメディアエクストラクタによって識別されるサンプルを含む、メディアエクストラクタトラック１０６に対応するサンプルをビデオデコーダ４８に送り得る。したがって、メディアエクストラクタトラック１０６が選択されたとき、デマルチプレクサ３８は、エクストラクタ１２０とエクストラクタ１２２とエクストラクタ１２４とをデリファレンスすることによって、デマルチプレクサ３８が完全なサブセットトラック１０４から検索し得るＩ符号化サンプル１１２とＰ符号化サンプル１１４とＢ符号化サンプル１１８とをビデオデコーダ４８に送り得る。 When the complete subset track 104 is selected, the demultiplexer 38 selects samples corresponding to the complete subset track 104 (eg, I-coded samples 112, P-coded samples 114, B-coded samples 116, and B-coded samples). Sample 118) may be sent to video decoder 48. When the media extractor track 106 is selected, the demultiplexer 38 sends samples corresponding to the media extractor track 106 to the video decoder 48, including samples identified by the media extractor corresponding to the media extractor track 106. obtain. Thus, when the media extractor track 106 is selected, the demultiplexer 38 can retrieve from the complete subset track 104 by dereferencing the extractor 120, extractor 122, and extractor 124. The I encoded sample 112, the P encoded sample 114, and the B encoded sample 118 may be sent to the video decoder 48.

図４は、２つの別個のエクストラクタトラック１４６、１４８を含む別の例示的なファイル１４０を示すブロック図である。図４の例では、２つのエクストラクタトラックが示されているが、概して、ファイルは任意の数のエクストラクタトラックを含み得る。図４の例では、ファイル１４０は、ＭＯＯＶボックス１４２とＭＤＡＴボックス１５０とを含む。ＭＯＯＶボックス１４２は、完全なサブセットトラック１４４とメディアエクストラクタトラック１４６、１４８とを含む。ＭＤＡＴボックス１５０は、様々なトラックのためのデータのサンプルおよびエクストラクタ、たとえば、Ｉ符号化サンプル１５２、Ｐ符号化サンプル１５４、Ｂ符号化サンプル１５６、Ｂ符号化サンプル１５８、およびエクストラクタ１６０〜１６８を含む。 FIG. 4 is a block diagram illustrating another exemplary file 140 that includes two separate extractor tracks 146, 148. In the example of FIG. 4, two extractor tracks are shown, but in general, a file may include any number of extractor tracks. In the example of FIG. 4, the file 140 includes a MOOV box 142 and an MDAT box 150. The MOOV box 142 includes a complete subset track 144 and media extractor tracks 146, 148. The MDAT box 150 includes data samples and extractors for various tracks, eg, I-coded samples 152, P-coded samples 154, B-coded samples 156, B-coded samples 158, and extractors 160-168. including.

図４の例では、エクストラクタ１６０〜１６４はメディアエクストラクタトラック１４６に対応するが、エクストラクタ１６６〜１６８はメディアエクストラクタトラック１４８に対応する。この例では、メディアエクストラクタトラック１４６のエクストラクタ１６０は、Ｉ符号化サンプル１５２を識別し、エクストラクタ１６２は、Ｐ符号化サンプル１５４を識別し、エクストラクタ１６４は、Ｂ符号化サンプル１５６を識別する。この例では、エクストラクタ１６６は、Ｉ符号化サンプル１５２を識別するが、エクストラクタ１６２は、Ｐ符号化サンプル１５４を識別する。図４の例は、様々なメディアエクストラクタトラックの２つ以上のエクストラクタが、完全なサブセットトラックの同じサンプルを参照する例を示している。 In the example of FIG. 4, the extractors 160 to 164 correspond to the media extractor track 146, while the extractors 166 to 168 correspond to the media extractor track 148. In this example, extractor 160 of media extractor track 146 identifies I encoded sample 152, extractor 162 identifies P encoded sample 154, and extractor 164 identifies B encoded sample 156. To do. In this example, extractor 166 identifies I encoded sample 152, while extractor 162 identifies P encoded sample 154. The example of FIG. 4 shows an example where two or more extractors of various media extractor tracks reference the same sample of a complete subset track.

メディアエクストラクタトラックは、復号可能であり、元の完全時間分解能ビットストリームを含んでいるトラック、たとえば、完全なサブセットトラック１４４の代替／スイッチトラックであるビデオストリームの時間サブセットを表すために使用され得る。完全なサブセットトラック１４４は、たとえば、３０フレーム毎秒（ＦＰＳ）ビデオストリームを表し得る。いくつかの例では、ある階層レベルのＢピクチャをサブビットストリーム中に含めないことによって、サブビットストリームのフレームレートは、半分にされるか、またはある他の部分だけ低減され得る。たとえば、メディアエクストラクタトラック１４６は、Ｂ符号化サンプル１５８を含めないことによって、完全なサブセットトラック１４４に対して半分にされたフレームレートを有し得る。たとえば、メディアエクストラクタトラック１４６は、フレームレート１５ＦＰＳを有し得る。同様に、メディアエクストラクタトラック１４８は、Ｂ符号化サンプル１５６とＢ符号化サンプル１５８の両方を省略することによって、メディアエクストラクタトラック１４６に対して半分にされたフレームレートを有し、したがって、フレームレート７．５ＦＰＳを有し得る。 The media extractor track can be used to represent a temporal subset of a video stream that is decodable and that contains the original full temporal resolution bitstream, eg, a video stream that is an alternative / switch track of the full subset track 144 . The complete subset track 144 may represent, for example, a 30 frame per second (FPS) video stream. In some examples, by not including certain hierarchical level B pictures in the sub-bitstream, the frame rate of the sub-bitstream may be halved or reduced by some other part. For example, the media extractor track 146 may have a frame rate that is halved relative to the full subset track 144 by not including the B encoded samples 158. For example, the media extractor track 146 may have a frame rate of 15 FPS. Similarly, media extractor track 148 has a frame rate that is halved relative to media extractor track 146 by omitting both B encoded sample 156 and B encoded sample 158, and thus frame May have a rate of 7.5 FPS.

図５は、サブセットトラック１８８と、２つのメディアエクストラクタトラック１８４、１８６とを含む別の例示的なファイル１８０を示すブロック図である。ファイル１８０のＭＯＯＶボックス１８２は、サブセットトラック１８８と、メディアエクストラクタトラック１８４、１８６とを含むが、ＭＤＡＴボックス１９０は、Ｉ符号化サンプル１９２と、Ｐ符号化サンプル１９４と、Ｂ符号化サンプル２０２と、Ｂ符号化サンプル２０８と、エクストラクタ１９８、２００、２０４、２０６および２１０とを含む。 FIG. 5 is a block diagram illustrating another exemplary file 180 that includes a subset track 188 and two media extractor tracks 184, 186. The MOOV box 182 of the file 180 includes a subset track 188 and media extractor tracks 184 and 186, while the MDAT box 190 includes an I encoded sample 192, a P encoded sample 194, and a B encoded sample 202. , B encoded samples 208 and extractors 198, 200, 204, 206 and 210.

上記で説明したように、メディアエクストラクタトラックは、別のトラックのサンプルを参照するエクストラクタを含み得る。さらに、メディアエクストラクタトラックは、別のトラック中に含まれない追加のビデオサンプルをさらに含み得る。図５の例では、サブセットトラック１８８は、Ｉ符号化サンプル１９２とＰ符号化サンプル１９４とを含む。メディアエクストラクタトラック１８６は、エクストラクタ１９８、２００を含み、Ｂ符号化サンプル２０２をさらに含む。同様に、メディアエクストラクタトラック１８４は、エクストラクタ２０４、２０６、２１０と、さらにＢ符号化サンプル２０８とを含む。 As explained above, a media extractor track may include an extractor that references a sample of another track. Further, the media extractor track may further include additional video samples that are not included in another track. In the example of FIG. 5, subset track 188 includes I encoded samples 192 and P encoded samples 194. Media extractor track 186 includes extractors 198, 200 and further includes B encoded samples 202. Similarly, media extractor track 184 includes extractors 204, 206, 210 and B encoded samples 208.

図５の例では、メディアエクストラクタトラック１８６は、ビデオデータの符号化サンプル（Ｂ符号化サンプル２０２）を含み、メディアエクストラクタトラック１８４は、符号化サンプルを含むメディアエクストラクタトラック１８６のサンプルを参照するエクストラクタ２１０を含む。すなわち、図５の例では、エクストラクタ２１０は、Ｂ符号化サンプル２０２を参照する。したがって、メディアエクストラクタトラック１８４は、ビットストリームの完全時間分解能を表し得るが、メディアエクストラクタトラック１８６およびサブセットトラック１８８は、完全時間分解能ビットストリームのサブセットを表し得る。すなわち、メディアエクストラクタトラック１８６およびサブセットトラック１８８は、メディアエクストラクタトラック１８４によって表される完全時間分解能よりも低い時間分解能（たとえば、より低いフレームレート）を有し得る。 In the example of FIG. 5, media extractor track 186 includes encoded samples of video data (B encoded sample 202), and media extractor track 184 refers to a sample of media extractor track 186 that includes encoded samples. Including an extractor 210. That is, in the example of FIG. 5, the extractor 210 refers to the B encoded sample 202. Thus, media extractor track 184 may represent the full time resolution of the bitstream, while media extractor track 186 and subset track 188 may represent a subset of the full time resolution bitstream. That is, media extractor track 186 and subset track 188 may have a lower temporal resolution (eg, a lower frame rate) than the full temporal resolution represented by media extractor track 184.

本開示の技法によれば、Ｈ．２６４／ＡＶＣファイルフォーマットは、元の完全時間分解能ビットストリームを含んでいるトラックの任意の準拠している時間サブセットとして抽出され得るエクストラクタトラックを含めるように変更され得る。階層Ｂ（またはＰ）ピクチャコーディングをサポートするＨ．２６４／ＡＶＣの場合、Ｎの時間レベルがあると仮定すると、時間レベル０からｋ（ｋ＜Ｎ）までのサンプルを含む各サブビットストリームは、対応するエクストラクタトラックを定義することによって抽出され得る。したがって、同じビデオの場合、代替／スイッチグループを形成するＮ個のトラック（Ｎ−１個のエクストラクタトラックを含む）があり得る。エクストラクタは、エクストラクタによって識別されたサンプルの時間階層レベルに対応する時間階層レベルに関連することができる。また、たとえば、サンプルの時間レベルを指定する時間識別子値は、エクストラクタ中でシグナリングされ得る。 According to the techniques of this disclosure, H.264. The H.264 / AVC file format can be modified to include extractor tracks that can be extracted as any compliant time subset of the track that contains the original full time resolution bitstream. H.264 that supports layer B (or P) picture coding. For H.264 / AVC, assuming there are N time levels, each sub-bitstream containing samples from time levels 0 to k (k <N) can be extracted by defining the corresponding extractor track . Thus, for the same video, there can be N tracks (including N-1 extractor tracks) forming an alternate / switch group. The extractor can be associated with a time hierarchy level that corresponds to the time hierarchy level of the sample identified by the extractor. Also, for example, a time identifier value that specifies the time level of a sample may be signaled in the extractor.

図６Ａ〜図６Ｃは、様々なメディアエクストラクタトラックのためのメディアエクストラクタの例を含むファイルのＭＤＡＴボックス２２０の例を示すブロック図である。図６Ａ〜図６Ｃの各々は、ビュー０サンプル２２４Ａ、ビュー２サンプル２２６Ａ、ビュー１サンプル２２８Ａ、ビュー４サンプル２３０Ａ、およびビュー３サンプル２３２Ａを含むアンカーサンプル２２２と、ビュー０サンプル２２４Ｂ、ビュー２サンプル２２６Ｂ、ビュー１サンプル２２８Ｂ、ビュー４サンプル２３０Ｂ、およびビュー３サンプル２３２Ｂを含む非アンカーサンプル２２３とを示す。非アンカーサンプル２２３のそばの楕円は、追加のサンプルがＭＤＡＴボックス２２０中に含まれ得ることを示す。アンカーサンプルおよび非アンカーサンプルの各々は、ファイルの第１のトラックをまとめて形成し得る。一例では、本開示の技法によれば、図６Ａ〜図６Ｃに示すファイルのエクストラクタの各セットについてのメディアエクストラクタトラックは、ＭＶＣファイルフォーマットに準拠するビデオファイルの別々の動作点に対応し得る。このようにして、本開示の技法は、ＭＶＣファイルフォーマットに準拠するビデオファイルの動作点に対応する１つまたは複数のメディアエクストラクタトラックを生成するために使用され得る。 6A-6C are block diagrams illustrating an example of a file MDAT box 220 that includes examples of media extractors for various media extractor tracks. Each of FIGS. 6A-6C includes an anchor sample 222 including a view 0 sample 224A, a view 2 sample 226A, a view 1 sample 228A, a view 4 sample 230A, and a view 3 sample 232A, a view 0 sample 224B, and a view 2 sample 226B. , View 1 sample 228B, view 4 sample 230B, and non-anchor sample 223 including view 3 sample 232B. An ellipse beside the non-anchor sample 223 indicates that additional samples may be included in the MDAT box 220. Each of the anchor and non-anchor samples may collectively form the first track of the file. In one example, according to the techniques of this disclosure, the media extractor tracks for each set of file extractors shown in FIGS. 6A-6C may correspond to different operating points for video files that conform to the MVC file format. . In this way, the techniques of this disclosure may be used to generate one or more media extractor tracks that correspond to the operating point of a video file that conforms to the MVC file format.

図６Ａ〜図６Ｃは、様々なメディアエクストラクタトラックのエクストラクタ２４０、２４４、２５０を示し、エクストラクタ２４０、２４４、２５０は、それぞれＭＤＡＴボックス２２０中に含まれるが、明快のために別々の図に示される。すなわち、完全にアセンブルされたときに、ＭＤＡＴボックス２２０はエクストラクタ２４０、２４４および２５０の各セットを含み得る。 FIGS. 6A-6C illustrate various media extractor track extractors 240, 244, 250, each included in MDAT box 220, but separate views for clarity. Shown in That is, when fully assembled, MDAT box 220 may include each set of extractors 240, 244, and 250.

図６Ａ〜図６Ｃは、メディアエクストラクタならびに現実のビデオサンプルを含んでいるトラックを含むファイルの例を与える。様々なサンプルは、異なる時間レベルに従って異なるトラック中に別々に含まれ得る。各時間レベルについて、特定のトラックが、すべてのビデオサンプルならびにより低い時間レベルをもつトラックへのエクストラクタを含み得る。ビデオサンプル（ＮＡＬユニット）は異なるトラックに分離され得るが、より高いフレームレートをもつトラックは、他のトラックをポイントしているエクストラクタを有することができる。このようにして、１つの時間レベルのみのサンプルを含んでいるムービーフラグメントを有することが可能であり、ムービーフラグメントは、場合によっては、他のフラグメントをポイントしているエクストラクタを含み得る。この場合、異なるトラックのムービーフラグメントは、同じ時間期間がなければ、時間レベルの昇順でインターリーブされ得る。 FIGS. 6A-6C give examples of files containing media extractors as well as tracks containing real video samples. Various samples can be included separately in different tracks according to different time levels. For each time level, a particular track may include all video samples as well as extractors to tracks with lower time levels. Video samples (NAL units) can be separated into different tracks, but tracks with higher frame rates can have extractors pointing to other tracks. In this way, it is possible to have a movie fragment that contains samples of only one time level, and the movie fragment may in some cases include an extractor that points to another fragment. In this case, movie fragments of different tracks can be interleaved in ascending order of time level if they do not have the same time period.

図６Ａは、メディアエクストラクタトラックに対応するエクストラクタ２４２Ａ〜２４２Ｎを含むエクストラクタ２４０の例を与える。この例では、エクストラクタ２４２Ａは、アンカーサンプル２２２のビュー０サンプル２２４Ａを参照する。エクストラクタ２４２Ｎは、非アンカーサンプル２２３のビュー０サンプル２２４Ｂを参照する。概して、図６Ａの例では、エクストラクタセット２４０のエクストラクタは、対応するビュー０サンプルを参照する。エクストラクタ２４２Ａ〜２４２Ｎの各々は、スイッチグループおよび／または代替グループに属し得る共通のメディアエクストラクタトラックに対応する。メディアエクストラクタトラックは、個々の動作点、たとえば、ビュー０を含む動作点にさらに対応し得る。 FIG. 6A provides an example of an extractor 240 that includes extractors 242A-242N corresponding to media extractor tracks. In this example, extractor 242A references view 0 sample 224A of anchor sample 222. Extractor 242N references view 0 sample 224B of non-anchor sample 223. In general, in the example of FIG. 6A, the extractors in extractor set 240 reference the corresponding view 0 samples. Each of the extractors 242A-242N corresponds to a common media extractor track that may belong to a switch group and / or an alternate group. The media extractor track may further correspond to individual operating points, eg, operating points including view 0.

いくつかの例では、ＭＶＣを使用してコーディングされたステレオビデオの場合、２つのビューを出力することをサポートする１つの動作点と、ただ１つのビュー（たとえば、ビュー０またはビュー１だけ）を出力することをサポートする第２の動作点とを含む３つの動作点があるとすることができる。第３の動作点は、ビュー１を出力する動作点とすることができる。予測関係に応じて、第３の動作点は、ビュー１中のＶＣＬＮＡＬユニットおよび関連する非ＶＣＬＮＡＬユニットのみ、ビュー０およびビュー１のすべてのＮＡＬユニット、またはビュー１中のＮＡＬユニットならびにアンカーＮＡＬユニット（すなわち、アンカービュー構成要素のＮＡＬユニット）を含み得る。そのようなステレオの場合、開示する技法の例は、他の２つの動作点が２つのエクストラクタトラックによって表され得ることを与え得る。これらの２つのエクストラクタトラックはスイッチグループを形成し得、元のビデオトラックとともに、これらの３つのトラックは代替グループを形成し得る。 In some examples, for stereo video coded using MVC, one operating point that supports outputting two views, and only one view (eg, view 0 or view 1 only) There can be three operating points including a second operating point that supports output. The third operating point can be an operating point that outputs view 1. Depending on the prediction relationship, the third operating point is the VCL NAL unit in view 1 and the associated non-VCL NAL unit only, all NAL units in view 0 and view 1, or the NAL unit in view 1 and the anchor NAL. Units (ie, NAL units of anchor view components) may be included. In the case of such stereo, the example technique disclosed may provide that the other two operating points can be represented by two extractor tracks. These two extractor tracks can form a switch group, and together with the original video track, these three tracks can form an alternate group.

本開示は、ＭＶＣメディアエクストラクタトラックを含むようにＭＶＣファイルフォーマットを変更するための技法を提供する。概して、出力のための同数のビューとともに、ＭＶＣメディアエクストラクタトラックを含むＭＶＣビデオトラックは、スイッチグループとして特徴づけられ得る。ファイルのトラックによって表されるすべての動作点は、ＭＶＣビデオプレゼンテーションの１つの代替グループに属し得る。アンカーサンプル２２２および非アンカーサンプル２２３の各々のビューは、完全なサブセットトラック、たとえば、利用可能なビューのすべてを含む動作点を形成し得る。 The present disclosure provides techniques for changing the MVC file format to include MVC media extractor tracks. In general, MVC video tracks, including MVC media extractor tracks, with the same number of views for output, can be characterized as a switch group. All operating points represented by a track of a file can belong to one alternative group of MVC video presentations. Each view of anchor sample 222 and non-anchor sample 223 may form an operating point that includes a complete subset track, eg, all of the available views.

エクストラクタは、たとえば、図６Ｂ中のエクストラクタ２４６Ａ〜２４６Ｎに関して示されるようにサンプルの連続部分を参照し得る。図６Ｂの例では、エクストラクタ２４６Ａは、ビュー０サンプル２２４Ａと、ビュー２サンプル２２６Ａとを参照する。エクストラクタ２４６Ａを表すデータ構造は、識別されたビューのためのバイト範囲、開始ビューおよび終了ビュー、開始ビューおよび後続のビューの数、またはエクストラクタによって識別される連続の一連のビューの他の表現を指定し得る。エクストラクタ２４４のセットは別のメディアエクストラクタトラックに対応し得、メディアエクストラクタトラックは、順に別々のＭＶＣ動作点に対応し得る。 The extractor may refer to a continuous portion of the sample as shown, for example, with respect to extractors 246A-246N in FIG. 6B. In the example of FIG. 6B, extractor 246A refers to view 0 sample 224A and view 2 sample 226A. The data structure representing extractor 246A can be a byte range for the identified view, a starting and ending view, a number of starting and subsequent views, or other representations of a series of consecutive views identified by the extractor. Can be specified. The set of extractors 244 may correspond to different media extractor tracks, which may in turn correspond to different MVC operating points.

また、２つのエクストラクタは、たとえば、図６Ｃ中のエクストラクタ２５４Ａ、２５６Ａに関して示されるように、サンプルの２つの部分（たとえば、２つの非連続ビュー）を参照し得る。たとえば、エクストラクタサンプル２５２Ａは、ビュー０サンプル２２４Ａとビュー２サンプル２２６Ａとを参照するエクストラクタ２５４Ａ、ならびにビュー４サンプル２３０Ａを参照するエクストラクタ２５４Ｂを含む。したがって、エクストラクタサンプル２５２Ａによって表されるサンプルは、非連続ビューサンプルを参照するエクストラクタサンプルに対応し得る。同様に、エクストラクタサンプル２５２Ｎは、図６Ｃの例では、ビュー０サンプル２２４Ｂとビュー２サンプル２２６Ｂとを参照するエクストラクタ２５６Ａ、ならびにビュー４サンプル２３０Ｂを参照するエクストラクタ２５６Ｂを含む。 Also, the two extractors may reference two parts of the sample (eg, two non-contiguous views), for example, as shown with respect to extractors 254A, 256A in FIG. 6C. For example, extractor sample 252A includes an extractor 254A that references view 0 sample 224A and view 2 sample 226A, and an extractor 254B that references view 4 sample 230A. Thus, the sample represented by extractor sample 252A may correspond to an extractor sample that references a discontinuous view sample. Similarly, the extractor sample 252N includes an extractor 256A that references the view 0 sample 224B and the view 2 sample 226B and an extractor 256B that references the view 4 sample 230B in the example of FIG. 6C.

また、エクストラクタは、アンカーまたは非アンカーサンプルに関して定義され得、アンカーサンプルに関して定義されるエクストラクタは、非アンカーサンプルに関して定義されるエクストラクタとは異なるビューを参照し得る。 An extractor may also be defined with respect to anchor or non-anchor samples, and an extractor defined with respect to anchor samples may reference a different view than an extractor defined with respect to non-anchor samples.

ＩＳＯベースメディアファイルフォーマットまたはＭＶＣファイルフォーマットでの上記のＭＶＣメディアエクストラクタトラックは、同様の抽出機能を用いて実装され得、通常のビデオトラックの代替および／またはスイッチトラックを表すために使用され得るメタデータトラックのインスタンスとすることができる。 The MVC media extractor track described above in the ISO base media file format or MVC file format can be implemented with similar extraction functions and can be used to represent regular video track alternatives and / or switch tracks. It can be an instance of a data track.

ＭＶＣファイルフォーマットを使用する例では、１つのトラック中に完全ビットストリームが含まれ得、すべての他の可能な動作点は、エクストラクタトラックによって表され得、その各々は、たとえば、出力のためのビューの数、出力のためのビューのビュー識別子値、送信に必要な帯域幅、およびフレームレートをシグナリングし得る。 In an example using the MVC file format, a complete bitstream may be included in one track, and all other possible operating points may be represented by extractor tracks, each of which is for example for output The number of views, the view identifier value of the view for output, the bandwidth required for transmission, and the frame rate may be signaled.

図７は、例示的なＭＶＣ予測パターンを示す概念図である。図７の例では、（ビューＩＤ「Ｓ０」〜「Ｓ７」を有する）８つのビューが示され、各ビューについて１２個の時間ロケーション（「Ｔ０」〜「Ｔ１１」）が示されている。すなわち、図７中の各行はビューに対応し、各列は時間ロケーションを示す。 FIG. 7 is a conceptual diagram illustrating an exemplary MVC prediction pattern. In the example of FIG. 7, eight views (with view IDs “S0” to “S7”) are shown, and twelve temporal locations (“T0” to “T11”) are shown for each view. That is, each row in FIG. 7 corresponds to a view, and each column indicates a temporal location.

ＭＶＣがＨ．２６４／ＡＶＣデコーダによって復号可能である、いわゆるベースビューを有し、また、ステレオビューペアがＭＶＣによってサポートされ得るが、ＭＶＣの利点は、ＭＶＣが、３Ｄビデオ入力として３つ以上のビューを使用し、複数のビューによって表されるこの３Ｄビデオを復号する例をサポートすることができるということである。ＭＶＣデコーダを有するクライアントのレンダラは、複数のビューを用いて３Ｄビデオコンテンツを予想し得る。ビュー中のアンカービュー構成要素および非アンカービュー構成要素は、異なるビュー依存性を有することができる。たとえば、ビューＳ２中のアンカービュー構成要素は、ビューＳ０中のビュー構成要素に依存する。ただし、ビューＳ２中の非アンカービュー構成要素は、他のビュー中のビュー構成要素に依存しない。 MVC is H.264. Although it has a so-called base view that can be decoded by an H.264 / AVC decoder, and stereo view pairs can be supported by MVC, the advantage of MVC is that MVC uses more than two views as 3D video input. It can support the example of decoding this 3D video represented by multiple views. A renderer of a client with an MVC decoder can use a plurality of views to predict 3D video content. Anchor view components and non-anchor view components in a view can have different view dependencies. For example, the anchor view component in view S2 depends on the view component in view S0. However, the non-anchor view components in view S2 do not depend on the view components in other views.

図７中のフレームは、文字を含む影付きブロックを使用して、図７中の各行と各列とについて示され、その指示は、対応するフレームがイントラコード化された（すなわち、Ｉフレーム）のか、または一方向でインターコード化された（すなわち、Ｐフレームとして）のか、または複数の方向でインターコード化された（すなわち、Ｂフレームとして）のかを指示する。概して、予測は矢印によって示され、ここで矢印の終点のフレームは、予測参照のために矢印の始点のオブジェクトを使用する。たとえば、時間ロケーションＴ０におけるビューＳ２のＰフレームは、時間ロケーションＴ０におけるビューＳ０のＩフレームから予測される。 The frame in FIG. 7 is shown for each row and each column in FIG. 7 using a shaded block containing characters, and the indication is that the corresponding frame is intra-coded (ie, an I frame). Or inter-coded in one direction (ie, as a P frame) or inter-coded in multiple directions (ie, as a B frame). In general, the prediction is indicated by an arrow, where the arrow end frame uses the arrow start object for prediction reference. For example, the P frame of view S2 at time location T0 is predicted from the I frame of view S0 at time location T0.

単一のビュービデオ符号化の場合と同様に、マルチビュービデオコーディングビデオシーケンスのフレームは、異なる時間ロケーションにおけるフレームに関して予測符号化され得る。たとえば、時間ロケーションＴ１におけるビューＳ０のｂフレームは、時間ロケーションＴ０におけるビューＳ０のＩフレームからそのｂフレームに向けられた矢印を有し、その矢印は、ｂフレームがＩフレームから予測されることを示す。しかしながら、さらに、マルチビュービデオ符号化のコンテキストにおいて、フレームは、ビュー間予測され得る。すなわち、ビュー構成要素は、参照のために他のビュー中のビュー構成要素を使用することができる。ＭＶＣでは、たとえば、別のビュー中のビュー構成要素がインター予測参照であるかのように、ビュー間予測が実現される。潜在的なビュー間参照は、シーケンスパラメータセット（ＳＰＳ）ＭＶＣ拡張においてシグナリングされ、インター予測またはビュー間予測参照のフレキシブルな順序を可能にする参照ピクチャリスト構成プロセスによって変更され得る。以下の表１は、ＭＶＣ拡張シーケンスパラメータセットの例示的な定義を与える。

As with single view video encoding, frames of a multi-view video coding video sequence may be predictively encoded with respect to frames at different time locations. For example, the b frame of view S0 at time location T1 has an arrow pointing from the I frame of view S0 at time location T0 to that b frame, which indicates that the b frame is predicted from the I frame. Show. In addition, however, in the context of multi-view video coding, frames can be inter-view predicted. That is, view components can use view components in other views for reference. In MVC, for example, inter-view prediction is realized as if the view component in another view is an inter prediction reference. Potential inter-view references are signaled in a sequence parameter set (SPS) MVC extension and may be modified by a reference picture list construction process that allows flexible ordering of inter-prediction or inter-view prediction references. Table 1 below provides an exemplary definition of the MVC extended sequence parameter set.

図７は、ビュー間予測の様々な例を与える。図７の例では、ビューＳ１のフレームは、ビューＳ１の異なる時間ロケーションにおけるフレームから予測されるものとして、ならびに同じ時間ロケーションにおけるビューＳ０およびＳ２のフレームのうちのフレームからビュー間予測されるものとして示されている。たとえば、時間ロケーションＴ１におけるビューＳ１のｂフレームは、時間ロケーションＴ０およびＴ２におけるビューＳ１のＢフレームの各々、ならびに時間ロケーションＴ１におけるビューＳ０およびＳ２のｂフレームから予測される。 FIG. 7 gives various examples of inter-view prediction. In the example of FIG. 7, the frame of view S1 is assumed to be predicted from frames at different time locations of view S1, and is assumed to be inter-view predicted from the frames of views S0 and S2 at the same time location. It is shown. For example, the b frame of view S1 at time location T1 is predicted from each of the B frames of view S1 at time locations T0 and T2 and the b frames of views S0 and S2 at time location T1.

図７の例では、大文字の「Ｂ」および小文字の「ｂ」は、異なる符号化方法ではなく、フレーム間の異なる階層関係を示すものとする。概して、大文字の「Ｂ」フレームは、小文字の「ｂ」フレームよりも予測階層が比較的高い。すなわち、図７の例では、「ｂ」フレームは、「Ｂ」フレームに関して符号化される。図７の「ｂ」フレームを参照し得る追加の双方向符号化されたフレームを有する追加の階層レベルが追加され得る。図７はまた、異なるレベルの陰影を使用して予測階層の変形体を示し、より大きい量の陰影の（すなわち、比較的より暗い）フレームは、より少ない陰影を有する（すなわち、比較的より明るい）それらのフレームよりも予測階層が高い。たとえば、図７中のすべてのＩフレームは、完全陰影を用いて示されるが、Ｐフレームは、いくぶんより明るい陰影を有し、Ｂフレーム（そして、小文字のｂフレーム）は、互いに様々なレベルの陰影を有するが、ＰフレームおよびＩフレームの陰影よりも常に明るい。 In the example of FIG. 7, the uppercase “B” and the lowercase “b” indicate different hierarchical relationships between frames, not different encoding methods. In general, uppercase “B” frames have a relatively higher prediction hierarchy than lowercase “b” frames. That is, in the example of FIG. 7, the “b” frame is encoded with respect to the “B” frame. Additional hierarchical levels can be added with additional bi-coded frames that can refer to the “b” frame of FIG. FIG. 7 also illustrates a variant of the prediction hierarchy using different levels of shading, with a larger amount of shading (ie, relatively darker) frames having less shading (ie, relatively lighter). ) The prediction hierarchy is higher than those frames. For example, all I frames in FIG. 7 are shown with full shading, while P frames have somewhat brighter shading, and B frames (and lowercase b frames) have different levels of each other. Has shading, but is always brighter than the shading of P and I frames.

概して、比較的階層がより高いそれらのフレームが、階層が比較的低いフレームの復号中に参照フレームとして使用され得るように、予測階層が比較的より高いフレームは、階層が比較的より低いフレームを復号する前に復号されるべきであるという点で、予測階層はビュー順序インデックスに関係する。ビュー順序インデックスは、アクセスユニット中のビュー構成要素の復号順序を示すインデックスである。Ｈ．２６４／ＡＶＣ（ＭＶＣ追補）の付属書類Ｈにおいて規定されているように、ビュー順序インデックスはＳＰＳＭＶＣ拡張において暗示されている。ＳＰＳでは、各インデックスｉについて、対応するｖｉｅｗ＿ｉｄがシグナリングされる。ビュー構成要素の復号は、ビュー順序インデックスの昇順に従う。すべてのビューが提示された場合、ビュー順序インデックスは、０からｎｕｍ＿ｖｉｅｗｓ＿ｍｉｎｕｓ＿１までの連続する順序である。 In general, a frame with a relatively high prediction hierarchy is a frame with a relatively lower hierarchy so that those frames with a higher hierarchy can be used as reference frames during decoding of a frame with a lower hierarchy. The prediction hierarchy is related to the view order index in that it should be decoded before decoding. The view order index is an index indicating the decoding order of the view components in the access unit. H. The view order index is implied in the SPS MVC extension, as specified in Annex H of H.264 / AVC (MVC Addendum). In SPS, for each index i, the corresponding view_id is signaled. Decoding of view components follows the ascending order of the view order index. If all views are presented, the view order index is a sequential order from 0 to num_views_minus_1.

このようにして、参照フレームとして使用されるフレームは、その参照フレームを参照して符号化されたフレームを復号する前に復号され得る。ビュー順序インデックスは、アクセスユニット中のビュー構成要素の復号順序を示すインデックスである。各ビュー順序インデックスｉについて、対応するｖｉｅｗ＿ｉｄがシグナリングされる。ビュー構成要素の復号は、ビュー順序インデックスの昇順に従う。すべてのビューが提示された場合、ビュー順序インデックスのセットは、０からビューの全数よりも１少ない数までの連続的な順序付きセットを備える。 In this way, a frame used as a reference frame can be decoded before decoding a frame encoded with reference to the reference frame. The view order index is an index indicating the decoding order of the view components in the access unit. For each view order index i, the corresponding view_id is signaled. Decoding of view components follows the ascending order of the view order index. If all views are presented, the set of view order indices comprises a continuous ordered set from 0 to one less than the total number of views.

階層の等しいレベルにおけるいくつかのフレームの場合、復号順序は、互いに重要でないことがある。たとえば、時間ロケーションＴ０におけるビューＳ０のＩフレームは、時間ロケーションＴ０におけるビューＳ２のＰフレームのための参照フレームとして使用され、そのＰフレームは今度は、時間ロケーションＴ０におけるビューＳ４のＰフレームのための参照フレームとして使用される。したがって、時間ロケーションＴ０におけるビューＳ０のＩフレームは、時間ロケーションＴ０におけるビューＳ２のＰフレームの前に復号されるべきであり、そのＰフレームは、時間ロケーションＴ０におけるビューＳ４のＰフレームの前に復号されるべきである。しかしながら、ビューＳ１およびＳ３は、予測のために互いに依拠しないが、代わりに、予測階層がより高いビューからのみ予測されるので、ビューＳ１とＳ３との間で復号順序は重要でない。その上、ビューＳ１がビューＳ０およびＳ２の後に復号される限り、ビューＳ１はビューＳ４の前に復号され得る。 For some frames at equal levels in the hierarchy, the decoding order may not be important to each other. For example, the I frame of view S0 at time location T0 is used as a reference frame for the P frame of view S2 at time location T0, which in turn is for the P frame of view S4 at time location T0. Used as a reference frame. Thus, the I frame of view S0 at time location T0 should be decoded before the P frame of view S2 at time location T0, and the P frame is decoded before the P frame of view S4 at time location T0. It should be. However, views S1 and S3 do not rely on each other for prediction, but instead, the decoding order is not important between views S1 and S3 because the prediction hierarchy is only predicted from higher views. Moreover, view S1 can be decoded before view S4 as long as view S1 is decoded after views S0 and S2.

このようにして、ビューＳ０〜Ｓ７を記述するために階層順序が使用され得る。表記法ＳＡ＞ＳＢは、ビューＳＡがビューＳＢの前に復号されるべきであることを意味する。この表記法を使用すると、図７の例では、Ｓ０＞Ｓ２＞Ｓ４＞Ｓ６＞Ｓ７である。また、図７の例に関して、Ｓ０＞Ｓ１、Ｓ２＞Ｓ１、Ｓ２＞Ｓ３、Ｓ４＞Ｓ３、Ｓ４＞Ｓ５、およびＳ６＞Ｓ５である。これらの要件に違反しないビューのための任意の復号順序が可能である。したがって、いくつかの制限のみをもつ、多くの異なる復号順序が可能である。２つの例示的な復号順序が以下に提示されるが、多くの他の復号順序が可能であることを理解されたい。以下の表２に示す一例では、ビューができるだけ早く復号される。

In this way, a hierarchical order can be used to describe views S0-S7. The notation SA> SB means that the view SA should be decoded before the view SB. Using this notation, in the example of FIG. 7, S0>S2>S4>S6> S7. For the example of FIG. 7, S0> S1, S2> S1, S2> S3, S4> S3, S4> S5, and S6> S5. Any decoding order for views that do not violate these requirements is possible. Thus, many different decoding orders are possible with only some limitations. Two exemplary decoding orders are presented below, but it should be understood that many other decoding orders are possible. In the example shown in Table 2 below, the view is decoded as soon as possible.

表２の例は、ビューＳ１は、ビューＳ０およびＳ２が復号された直後に復号され得、ビューＳ３は、ビューＳ２およびＳ４が復号された直後に復号され得、ビューＳ５は、ビューＳ４およびＳ６が復号された直後に復号され得ることを認識する。 In the example of Table 2, view S1 may be decoded immediately after views S0 and S2 are decoded, view S3 may be decoded immediately after views S2 and S4 are decoded, and view S5 is viewed as views S4 and S6. Recognizes that it can be decoded immediately after it is decoded.

以下の表３では、別のビューのための参照として使用されるいずれのビューも、他のビューのための参照として使用されないビューの前に復号されるような復号順序である、別の例示的な復号順序を与える。

In Table 3 below, another example where any view used as a reference for another view is in decoding order such that it is decoded before a view that is not used as a reference for another view. Give the correct decoding order.

表３の例は、ビューＳ１、Ｓ３、Ｓ５、およびＳ７のフレームが、他のビューのフレームのための参照フレームとして働かず、したがって、ビューＳ１、Ｓ３、Ｓ５、およびＳ７が、図７の例におけるビュー、すなわち、ビューＳ０、Ｓ２、Ｓ４、およびＳ６の、参照フレームとして使用されるフレームの後に復号され得ることを認識する。互いに対して、ビューＳ１、Ｓ３、Ｓ５、およびＳ７は任意の順序で復号され得る。したがって、表３の例では、ビューＳ７は、ビューＳ１、Ｓ３、およびＳ５の各々の前に復号される。 The example in Table 3 shows that the frames of views S1, S3, S5, and S7 do not serve as reference frames for the frames of other views, so views S1, S3, S5, and S7 are examples of FIG. In view, ie, views S0, S2, S4, and S6 can be decoded after the frame used as the reference frame. For each other, views S1, S3, S5, and S7 may be decoded in any order. Thus, in the example of Table 3, view S7 is decoded before each of views S1, S3, and S5.

明快のために、各ビューのフレーム間に、ならびに各ビューのフレームの時間ロケーション間に、階層関係があり得る。図７の例に関して、時間ロケーションＴ０におけるフレームは、時間ロケーションＴ０における他のビューのフレームからイントラ予測されるか、またはビュー間予測される。同様に、時間ロケーションＴ８におけるフレームは、時間ロケーションＴ８における他のビューのフレームからイントラ予測されるか、またはビュー間予測される。したがって、時間階層に関して、時間ロケーションＴ０およびＴ８は時間階層の最上位にある。 For clarity, there may be a hierarchical relationship between the frames of each view as well as between the time locations of the frames of each view. With respect to the example of FIG. 7, the frame at temporal location T0 is intra predicted from the frames of other views at temporal location T0 or is inter-view predicted. Similarly, the frame at temporal location T8 is intra predicted from the frames of other views at temporal location T8 or inter-view predicted. Thus, with respect to the time hierarchy, time locations T0 and T8 are at the top of the time hierarchy.

図７の例では、時間ロケーションＴ４のフレームが、時間ロケーションＴ０およびＴ８のフレームを参照してＢ符号化されるので、時間ロケーションＴ４におけるフレームは、時間ロケーションＴ０およびＴ８のフレームよりも時間階層が低い。時間ロケーションＴ２およびＴ６におけるフレームは、時間ロケーションＴ４におけるフレームよりも時間階層が低い。最後に、時間ロケーションＴ１、Ｔ３、Ｔ５、およびＴ７におけるフレームは、時間ロケーションＴ２およびＴ６のフレームよりも時間階層が低い。 In the example of FIG. 7, the frame at time location T4 is B-coded with reference to the frames at time locations T0 and T8, so that the frame at time location T4 has a higher time hierarchy than the frames at time locations T0 and T8. Low. Frames at time locations T2 and T6 are lower in time hierarchy than frames at time location T4. Finally, the frames at time locations T1, T3, T5, and T7 are lower in time hierarchy than the frames at time locations T2 and T6.

ＭＶＣでは、全ビットストリームのサブセットが抽出されて、依然としてＭＶＣに準拠するサブビットストリームが形成され得る。たとえば、サーバによって与えられるサービス、１つまたは複数のクライアントのデコーダの容量、サポート、および能力、ならびに／または１つまたは複数のクライアントの選好に基づいて、特定の適用例が必要とし得る、多くの可能なサブビットストリームがある。たとえば、あるクライアントが３つのビューのみを必要とし得、２つのシナリオがあり得る。一例では、あるクライアントは、滑らかな閲覧エクスペリエンスを必要とし、ｖｉｅｗ＿ｉｄ値Ｓ０、Ｓ１、およびＳ２のビューを選好し得、別の他のクライアントは、ビュースケーラビリティを必要とし、ｖｉｅｗ＿ｉｄ値Ｓ０、Ｓ２、およびＳ４のビューを選好し得る。元来ｖｉｅｗ＿ｉｄが表９の例に関して順序付けられている場合、これらの２つの例においてビュー順序インデックス値はそれぞれ｛０、１、２｝および｛０、１、４｝である。これらのサブビットストリームの両方が、独立したＭＶＣビットストリームとして復号され、同時にサポートされ得ることに留意されたい。 In MVC, a subset of the entire bitstream can be extracted to form a sub-bitstream that is still compliant with MVC. For example, many applications may require a particular application based on the service provided by the server, the capacity, support and capabilities of one or more client decoders, and / or one or more client preferences. There are possible sub-bitstreams. For example, a client may need only three views and there may be two scenarios. In one example, one client may require a smooth browsing experience and may prefer views with view_id values S0, S1, and S2, while another client may require view scalability, and view_id values S0, S2, and You may prefer the view of S4. If the view_id was originally ordered with respect to the example in Table 9, the view order index values in these two examples are {0, 1, 2} and {0, 1, 4}, respectively. Note that both of these sub-bitstreams can be decoded and supported simultaneously as independent MVC bitstreams.

ＭＶＣデコーダによって復号可能である多くのＭＶＣサブビットストリームがあり得る。理論上、（１）各アクセスユニット中のビュー構成要素が、ビュー順序インデックスの昇順で順序付けられている、および（２）ビューの任意の組合せ中の各ビューについて、そのビューの依存ビューも上記組合せ中に含まれる、という２つのプロパティを満たす上記組合せは、一定のプロファイルまたはレベルに準拠するＭＶＣデコーダによって復号され得る。 There can be many MVC sub-bitstreams that can be decoded by an MVC decoder. Theoretically, (1) the view components in each access unit are ordered in ascending order of the view order index, and (2) for each view in any combination of views, that view's dependent view is also the above combination The above combination that satisfies the two properties of being included in can be decoded by an MVC decoder that conforms to a certain profile or level.

本開示の技法に関して、メディアエクストラクタトラックおよび／または純粋ビデオサンプルトラックを使用して様々なＭＶＣサブビットストリームが表され得る。これらのトラックの各々は、ＭＶＣ動作点に対応し得る。 With respect to the techniques of this disclosure, various MVC sub-bitstreams may be represented using media extractor tracks and / or pure video sample tracks. Each of these tracks may correspond to an MVC operating point.

図８〜図２１は、メディアエクストラクタのためのデータ構造、および本開示の技法に従って使用され得る他のサポートするデータ構造の様々な例を示すブロック図である。図８〜図２２の様々なメディアエクストラクタは、以下で詳細に説明する様々な特徴を含む。概して、図８〜図２１のメディアエクストラクタのいずれかは、ファイルのコード化サンプルを識別するために、ＩＳＯベースメディアファイルフォーマットまたはＩＳＯベースメディアファイルフォーマットに対する拡張に準拠するファイルのメディアエクストラクタトラック中に含まれ得る。概して、メディアエクストラクタは、参照されたトラックから１つまたは複数の全サンプルを抽出するために使用され得る。図８〜図１２は、別のトラックの１つのビデオサンプルボックスを識別することが可能であるメディアエクストラクタの例である。図１３に示すように、エクストラクタを実装する別の方法は、別のトラックからのサンプルのサンプルグルーピングを可能にすることである。時間スケーラビリティのためのより具体的なサポートを与えるために、図１４に示すように、時間識別子がシグナリングされ得る。図１６〜図２２は、ＭＶＣのためのメディアエクストラクタの例であり、各ビデオサンプルボックス（アクセスユニット）から１つまたは複数の潜在的な非連続ＮＡＬユニットを抽出することが可能である。エクストラクタの様々な例は、ファイルまたはアクセスユニット中のオフセットおよびバイトの長さに基づくが、他の例は、純粋に全ＮＡＬユニットのインデックスに基づくことができ、したがって、バイト範囲のシグナリングが必要でなくてよい。また、全ＮＡＬユニットのインデックスをもつシグナリングエクストラクタの機構は、ＳＶＣファイルフォーマットに拡張され得る。 8-21 are block diagrams illustrating various examples of data structures for media extractors and other supporting data structures that may be used in accordance with the techniques of this disclosure. The various media extractors of FIGS. 8-22 include various features described in detail below. In general, any of the media extractors of FIGS. 8-21 is in a media extractor track of a file that conforms to the ISO base media file format or an extension to the ISO base media file format to identify a coded sample of the file. Can be included. In general, a media extractor can be used to extract one or more whole samples from a referenced track. FIGS. 8-12 are examples of media extractors that can identify one video sample box of another track. As shown in FIG. 13, another way to implement an extractor is to allow sample grouping of samples from different tracks. To provide more specific support for time scalability, a time identifier may be signaled as shown in FIG. 16-22 are examples of media extractors for MVC, which can extract one or more potential non-contiguous NAL units from each video sample box (access unit). Various examples of extractors are based on offsets and lengths of bytes in the file or access unit, but other examples can be based purely on the index of all NAL units, thus requiring byte range signaling. Not necessary. Also, the mechanism of the signaling extractor with the index of all NAL units can be extended to the SVC file format.

また、図８〜図２１の例は、３ＧＰＰファイルフォーマットに対する拡張として、直接３ＧＰＰファイルフォーマットに適用され得る。また、図８〜図２１のうちの１つまたは複数の要素および概念は、他のエクストラクタを形成するために、図８〜図２２のうちの他の図の要素と組み合わせられ得る。図８〜図２１のうちのいくつかの図は、特定のファイルフォーマットに関して説明しているが、概して、図８〜図２１の例は、同様の特性をもつ任意のファイルフォーマット、たとえば、ＩＳＯベースメディアファイルフォーマットまたはＩＳＯベースメディアファイルフォーマットの拡張に関して使用され得る。３ＧＰＰにおいて提案されたエクストラクタの使用を可能にするために、３ＧＰＰトラック選択ボックスは、図２１の例に示すように、時間識別子、表示されるべきビューの数、および復号されるべきビューの数など、（抽出される）代替トラックの各々についてのより多くの特性を含むように拡張され得る。 8 to 21 can be directly applied to the 3GPP file format as an extension to the 3GPP file format. Also, one or more elements and concepts of FIGS. 8-21 may be combined with elements of other figures of FIGS. 8-22 to form other extractors. Although some of the figures in FIGS. 8-21 have been described with respect to particular file formats, in general, the examples of FIGS. 8-21 are arbitrary file formats with similar characteristics, eg, ISO-based It can be used in connection with an extension of the media file format or the ISO base media file format. To enable the use of the extractor proposed in 3GPP, the 3GPP track selection box includes a time identifier, the number of views to be displayed, and the number of views to be decoded, as shown in the example of FIG. Etc. and can be extended to include more properties for each of the alternative tracks (extracted).

図８は、メディアエクストラクタのフォーマットを示す例示的なメディアエクストラクタ３００を示すブロック図である。図８の例では、メディアエクストラクタ３００は、トラック参照インデックス３０２とサンプルオフセット値３０４とを含む。本開示の技法によれば、メディアエクストラクタ３００は、メディアエクストラクタトラック内でインスタンス化され得るデータ構造の定義に対応し得る。マルチプレクサ３０は、ビデオファイルの異なるトラックのＮＡＬユニットを識別するために、ビデオファイルのメディアエクストラクタトラック中にメディアエクストラクタ３００の例に準拠するエクストラクタを含めるように構成され得る。デマルチプレクサ３８は、メディアエクストラクタ３００に準拠するエクストラクタを使用して識別されたＮＡＬユニットを検索するように構成され得る。 FIG. 8 is a block diagram illustrating an exemplary media extractor 300 showing the format of the media extractor. In the example of FIG. 8, the media extractor 300 includes a track reference index 302 and a sample offset value 304. In accordance with the techniques of this disclosure, media extractor 300 may correspond to a definition of a data structure that can be instantiated within a media extractor track. Multiplexer 30 may be configured to include an extractor that conforms to the example of media extractor 300 in the media extractor track of the video file to identify NAL units of different tracks of the video file. Demultiplexer 38 may be configured to search for identified NAL units using an extractor that is compliant with media extractor 300.

トラック参照インデックス３０２は、識別されたＮＡＬユニットが存在するトラックの識別子に対応し得る。ビデオファイルのトラックを区別するために、ビデオファイルの各トラックには一意のインデックスを割り当てられ得る。トラック参照インデックス３０２は、データを抽出すべきトラックを発見するために使用するトラック参照のインデックスを指定し得る。そのデータが抽出されるトラック中のサンプルは、エクストラクタを含んでいるサンプルに正確に時間的に整合され得る（メディア復号タイムラインにおいて、時間サンプルテーブルを使用して、サンプルオフセット値３０４によって指定されたオフセットだけ調整される）。いくつかの例では、ビデオファイルの第１のトラックはインデックス値「１」を有し、したがって、マルチプレクサ３０は、ビデオファイルの第１のトラックを参照するトラック参照インデックス値３０２に値「１」を割り当て得る。トラック参照インデックス値の値「０」は、将来の使用のために予約され得る。 The track reference index 302 may correspond to the identifier of the track in which the identified NAL unit is present. In order to distinguish the tracks of the video file, each track of the video file may be assigned a unique index. The track reference index 302 may specify the index of the track reference used to find the track from which data is to be extracted. The sample in the track from which the data is extracted can be accurately time aligned to the sample containing the extractor (specified by the sample offset value 304 using the time sample table in the media decoding timeline. Only the offset is adjusted). In some examples, the first track of the video file has an index value “1”, and therefore the multiplexer 30 sets the value “1” to the track reference index value 302 that references the first track of the video file. Can be assigned. The value “0” of the track reference index value may be reserved for future use.

サンプルオフセット値３０４は、メディアエクストラクタトラック中のメディアエクストラクタ３００の時間ロケーションから、トラック参照インデックス３０２によって参照されるトラックの識別されたＮＡＬユニットまでのオフセット値を定義する。すなわち、サンプルオフセット値３０４は、情報源として使用されるリンクされたトラック中のサンプルの相対インデックスを与える。サンプルオフセット値３０４の値０は、エクストラクタを含んでいるサンプルと同じ、または最も近接して先行する復号時間をもつサンプルを参照する。サンプル１は次のサンプルであり、サンプル−１は前のサンプルであり、以下同様である。たとえば、メディアエクストラクタ３００に準拠するメディアエクストラクタが、Ｈ．２６３またはＭＰＥＧ−４ｐａｒｔ２で使用されるとき、メディアエクストラクタは、トラック参照インデックス３０２によって参照されるビデオトラックの時間サブセットを抽出するために使用され得る。 The sample offset value 304 defines an offset value from the time location of the media extractor 300 in the media extractor track to the identified NAL unit of the track referenced by the track reference index 302. That is, the sample offset value 304 gives the relative index of the sample in the linked track that is used as an information source. A sample offset value 304 of 0 refers to the sample with the same or closest preceding decoding time as the sample containing the extractor. Sample 1 is the next sample, Sample-1 is the previous sample, and so on. For example, a media extractor conforming to the media extractor 300 is H.264. When used in H.263 or MPEG-4 part 2, the media extractor can be used to extract a temporal subset of the video track referenced by the track reference index 302.

以下の擬似コードは、メディアエクストラクタ３００と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code provides an exemplary definition of a media extractor class similar to media extractor 300.

マルチプレクサ３０およびデマルチプレクサ３８は、上記の例示的な擬似コードにおいて定義されたメディアエクストラクタを使用してメディアエクストラクタデータオブジェクトをインスタンス化し得る。したがって、デマルチプレクサ３８は、たとえば、インスタンス化されたメディアエクストラクタによって参照された別のトラックから識別されたデータを取り出すために、選択されたトラックからデータを取り出すとき、インスタンス化されたメディアエクストラクタを参照し得る。 Multiplexer 30 and demultiplexer 38 may instantiate a media extractor data object using the media extractor defined in the example pseudocode above. Thus, when demultiplexer 38 retrieves data from a selected track, for example, to retrieve identified data from another track referenced by the instantiated media extractor, instantiated media extractor You can refer to

例示的な擬似コードでは、クラスＭｅｄｉａＥｘｔｒａｃｔｏｒ（）がバイト整合される。すなわち、エクストラクタがＭｅｄｉａＥｘｔｒａｃｔｏｒ（）クラスからインスタンス化されたとき、エクストラクタは８バイト境界上で整合される。変数「ｔｒａｃｋ＿ｒｅｆ＿ｉｎｄｅｘ」はトラック参照インデックス値３０２に対応し、この例示的な擬似コードでは、符号なしの８バイト整数値に対応する。変数「ｓａｍｐｌｅ＿ｏｆｆｓｅｔ」は、サンプルオフセット値３０４に対応し、この例では、符号付きの８バイト整数値に対応する。 In the exemplary pseudo code, the class MediaExtractor () is byte aligned. That is, when the extractor is instantiated from the MediaExtractor () class, the extractor is aligned on an 8-byte boundary. The variable “track_ref_index” corresponds to the track reference index value 302, and in this exemplary pseudo code corresponds to an unsigned 8-byte integer value. The variable “sample_offset” corresponds to the sample offset value 304, and in this example corresponds to a signed 8-byte integer value.

図９は、メディアエクストラクタ３１０の別の例を示すブロック図である。メディアエクストラクタ３１０は、トラック参照インデックス３１４とサンプルオフセット値３１６とを含み、さらに、サンプルヘッダ３１２を含む。トラック参照インデックス３１４およびサンプルオフセット値３１６は、概してトラック参照インデックス３０２およびサンプルオフセット値３０４（図８）と同様のデータを含み得る。 FIG. 9 is a block diagram illustrating another example of the media extractor 310. The media extractor 310 includes a track reference index 314 and a sample offset value 316, and further includes a sample header 312. The track reference index 314 and sample offset value 316 may include data that is generally similar to the track reference index 302 and sample offset value 304 (FIG. 8).

サンプルヘッダ３１２は、Ｈ．２６４／ＡＶＣに対応する例では、メディアエクストラクタ３１０によって参照されるビデオサンプルのＮＡＬユニットヘッダに従って構築され得る。サンプルヘッダ３１２は、３つのシンタックス要素、ｆｏｒｂｉｄｄｅｎ＿ｚｅｒｏ＿ｂｉｔ、（３ビットを備え得る）ｎａｌ＿ｒｅｆ＿ｉｄｃ、（５ビットを備え得る）ｎａｌ＿ｕｎｉｔ＿ｔｙｐｅをもつデータの１バイトを含み得る。「ｎａｌ＿ｕｎｉｔ＿ｔｙｐｅ」の値は２９（または任意の他の予約済みの数）であり得、他の２つのシンタックス要素は、識別されたビデオサンプル中のそれらのシンタックス要素と同様であり得る。ＭＰＥＧ−４ｐａｒｔ−２ビジュアルに準拠する例の場合、サンプルヘッダ３１２は、スタートコードプレフィックス「０ｘ０００００１」とスタートコード「０ｘＣ５」（または任意の他の予約済み数）とを含み得る４バイトコードを備え得、「０ｘ」は、「０ｘ」に続く値が１６進値であることを示す。Ｈ．２６３の場合、サンプルヘッダ３１２はまた、通常のビデオサンプルのスタートコードとは異なるバイト整合されたスタートコードを含み得る。エクストラクタが通常のビデオサンプルと考えられ得るように、サンプルヘッダ３１２は、同期の目的でデマルチプレクサ３８によって使用され得る。 The sample header 312 is an H.264 file. In an example corresponding to H.264 / AVC, it may be constructed according to the NAL unit header of the video sample referenced by the media extractor 310. The sample header 312 may include one byte of data with three syntax elements: forbidden_zero_bit, nal_ref_idc (which may comprise 3 bits), and nal_unit_type (which may comprise 5 bits). The value of “nal_unit_type” may be 29 (or any other reserved number), and the other two syntax elements may be similar to those syntax elements in the identified video sample. For an example compliant with MPEG-4 part-2 visual, the sample header 312 may include a start code prefix “0x 00 00 01” and a start code “0x C5” (or any other reserved number) 4 A byte code may be provided, and “0x” indicates that the value following “0x” is a hexadecimal value. H. In the case of H.263, the sample header 312 may also include a byte aligned start code that is different from the start code of a normal video sample. The sample header 312 can be used by the demultiplexer 38 for synchronization purposes so that the extractor can be considered a normal video sample.

以下の擬似コードは、メディアエクストラクタ３１０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code provides an exemplary definition of a media extractor class similar to media extractor 310.

図１０は、エクストラクタ内で識別されたＮＡＬユニットのバイト範囲をシグナリングすることによって、ＮＡＬユニットを識別する例示的なメディアエクストラクタ３２０を示すブロック図である。メディアエクストラクタ３２０は、サンプルヘッダ３１２と同様であり得るサンプルヘッダ３２２と、トラック参照インデックス３０２と同様であり得るトラック参照インデックス３２４とを含む。ただし、サンプルオフセット値ではなく、メディアエクストラクタ３２０の例はデータオフセット値３２６とデータ長値３２８とを含む。 FIG. 10 is a block diagram illustrating an example media extractor 320 that identifies a NAL unit by signaling the byte range of the NAL unit identified in the extractor. The media extractor 320 includes a sample header 322 that can be similar to the sample header 312 and a track reference index 324 that can be similar to the track reference index 302. However, instead of the sample offset value, the example of the media extractor 320 includes a data offset value 326 and a data length value 328.

データオフセット値３２６は、メディアエクストラクタ３２０によって識別されるデータの開始点を記述し得る。すなわち、データオフセット値３２６は、コピーすべき、トラックインデックス値３２４によって識別されるトラック内の第１のバイトへのオフセットを表す値を備え得る。データ長値３２８は、コピーすべきバイトの数を記述し得、したがって、参照されるサンプル（または複数のＮＡＬユニットを参照するときには複数のサンプル）の長さと等価であり得る。 Data offset value 326 may describe the starting point of the data identified by media extractor 320. That is, the data offset value 326 may comprise a value representing the offset to the first byte in the track identified by the track index value 324 to be copied. The data length value 328 may describe the number of bytes to be copied and thus may be equivalent to the length of the referenced sample (or multiple samples when referring to multiple NAL units).

以下の擬似コードは、メディアエクストラクタ３２０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code gives an exemplary definition of a media extractor class similar to media extractor 320.

図１１は、将来の拡張性のための予約済みビットを含んでいる例示的なメディアエクストラクタ３４０を示すブロック図である。メディアエクストラクタ３４０は、トラック参照インデックス３４２とサンプルオフセット値３４６とを含み、それらは、それぞれ、メディアエクストラクタ３０２およびサンプルオフセット値３０４と同様であり得る。さらに、メディアエクストラクタ３４０は、メディアエクストラクタに対する将来の拡張のために使用される予約済みビットを備え得る予約済みビット３４４を含む。以下の擬似コードは、メディアエクストラクタ３４０と同様のメディアエクストラクタクラスの例示的なクラス定義を与える。

FIG. 11 is a block diagram illustrating an example media extractor 340 that includes reserved bits for future scalability. Media extractor 340 includes track reference index 342 and sample offset value 346, which may be similar to media extractor 302 and sample offset value 304, respectively. In addition, media extractor 340 includes reserved bits 344 that may comprise reserved bits used for future extensions to the media extractor. The following pseudo code provides an example class definition for a media extractor class similar to media extractor 340.

図１２は、トラック参照インデックス値ではなく、トラック識別子値を使用する例示的なメディアエクストラクタ３５０を示すブロック図である。トラックを識別するためのトラック識別子値の使用は、ＩＳＯベースメディアファイルフォーマットでのトラック参照ボックスのプレゼンテーションを参照し得る。メディアエクストラクタ３５０の例は、トラック識別子３５２と予約済みビット３５４とサンプルオフセット値３５６とを含む。予約済みビット３５４は、予約済みビット３５４の周りの破線で示すように随意である。すなわち、いくつかの例は予約済みビット３５４を含み得るが、他の例は予約済みビット３５４を省略し得る。サンプルオフセット値３５６は、サンプルオフセット値３０４と同様であり得る。 FIG. 12 is a block diagram illustrating an example media extractor 350 that uses track identifier values rather than track reference index values. Use of the track identifier value to identify a track may refer to a presentation of a track reference box in the ISO base media file format. An example media extractor 350 includes a track identifier 352, a reserved bit 354, and a sample offset value 356. Reserved bit 354 is optional as indicated by the dashed line around reserved bit 354. That is, some examples may include reserved bits 354, while other examples may omit reserved bits 354. Sample offset value 356 may be similar to sample offset value 304.

トラック識別子３５２は、データを抽出すべきトラックのトラックＩＤを指定する。データが抽出されるトラック中のサンプルは、メディアエクストラクタ３５０を含んでいるサンプルに正確に時間的に整合され得る（メディア復号タイムラインにおいて、時間サンプルテーブルを使用して、サンプルオフセット３５６によって指定されたオフセットだけ調整される）。第１のトラック参照には、識別子値１が割り当てられ得る。値０は、将来の使用および拡張のために予約され得る。 The track identifier 352 specifies the track ID of the track from which data is to be extracted. The sample in the track from which the data is extracted can be accurately time aligned to the sample containing the media extractor 350 (specified by the sample offset 356 using the time sample table in the media decoding timeline. Only the offset is adjusted). The first track reference may be assigned an identifier value of 1. The value 0 may be reserved for future use and expansion.

以下の擬似コードは、メディアエクストラクタ３５０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code provides an exemplary definition of a media extractor class similar to media extractor 350.

図１３は、例示的なメディアエクストラクタサンプルグループ３６０を示すブロック図である。マルチプレクサ３０は、サンプルテーブルボックスコンテナにおいて（タイプ識別子「ＭＥＳＧ」を有する）メッセージタイプボックス中にメディアエクストラクタサンプルグループ３６０を含み得る。マルチプレクサ３０は、メッセージボックスにおいて０または１つのメディアエクストラクタサンプルグループ３６０オブジェクトを含むように構成され得る。図１３の例では、メディアエクストラクタサンプルグループ３６０は、トラック参照インデックス３６２と、グループタイプ３６４と、グループ数カウント３６６と、予約済みビット３６８と、グループ記述インデックス３７０とを含む。 FIG. 13 is a block diagram illustrating an exemplary media extractor sample group 360. Multiplexer 30 may include a media extractor sample group 360 in a message type box (with type identifier “MESG”) in the sample table box container. Multiplexer 30 may be configured to include zero or one media extractor sample group 360 objects in the message box. In the example of FIG. 13, the media extractor sample group 360 includes a track reference index 362, a group type 364, a group count 366, a reserved bit 368, and a group description index 370.

トラック参照インデックス３６２は、ある基準下でサンプルグループからデータを抽出すべきトラックを発見するために使用されるトラック参照のインデックスを指定する。すなわち、トラック参照インデックス３６２は、トラック参照インデックス３０２と同様の方法で、メディアエクストラクタによって識別されるデータを抽出すべきトラックを識別する。 The track reference index 362 specifies the index of the track reference used to find the track from which data is to be extracted from the sample group under certain criteria. That is, the track reference index 362 identifies the track from which the data identified by the media extractor is to be extracted in the same manner as the track reference index 302.

グループタイプ値３６４は、メディアエクストラクタサンプルグループ３６０が対応するサンプルグループのタイプを識別する。グループタイプ値３６４は、概してサンプリンググループのサンプルグループを形成するために使用される基準を識別し、トラック参照インデックス３６２によって識別されるトラック中でグループタイプの同じ値をもつサンプルグループ記述テーブルにその基準をリンクする。グループタイプ値３６４は整数値を備え得る。このようにして、メディアエクストラクタサンプルグループ３６０のグループタイプ値は、トラック参照インデックス３６２が参照するトラックのグループタイプと同様であり得る。代替的に、ビデオ時間サブセットの場合、グループタイプ値３６４は「ｖｔｓｔ」として定義され得、メディアエクストラクタサンプルグループはそのグループタイプのためにのみ定義され得、シンタックステーブルは「ｇｒｏｕｐｉｎｇ＿ｔｙｐｅ」のシンタックス要素を必要としないであろう。 The group type value 364 identifies the type of sample group to which the media extractor sample group 360 corresponds. The group type value 364 generally identifies the criteria used to form the sample group of the sampling group, and the criteria in the sample group description table having the same value for the group type in the track identified by the track reference index 362. Link. Group type value 364 may comprise an integer value. In this way, the group type value of the media extractor sample group 360 may be similar to the group type of the track referenced by the track reference index 362. Alternatively, for a video time subset, the group type value 364 may be defined as “vtst”, the media extractor sample group may be defined only for that group type, and the syntax table is a “grouping_type” syntax. You will not need the element.

グループ数カウント値３６６は、メディアエクストラクタサンプルグループ３６０を含むメディアエクストラクタトラック中のサンプルグループの数を記述し得る。グループ数カウント値３６６の値０は、グループタイプ値３６４によって参照される基準下でのすべてのサンプルグループが、メディアエクストラクタトラックを形成するために使用されることを表し得る。グループ記述インデックス３６８は、サンプルグループ記述テーブルにおいて、メディアエクストラクタトラックを形成するために使用されるサンプルグループエントリのインデックスを定義する。 The group number count value 366 may describe the number of sample groups in the media extractor track that includes the media extractor sample groups 360. A group count value 366 value of 0 may represent that all sample groups under the criteria referenced by the group type value 364 are used to form a media extractor track. The group description index 368 defines the index of the sample group entry that is used to form the media extractor track in the sample group description table.

本開示の技法によれば、メディアエクストラクタトラック中でサンプルＢに続くサンプルＡが、トラック参照インデックス３６２によって参照されるトラック中でサンプルＡがサンプルＢに続くことを示すように、サンプルが時間的に順序付けられるようにすべてのサンプルをサンプルグループエントリ中に配置するために、アセンブルプロセスが使用され得る。 In accordance with the techniques of this disclosure, a sample is temporal in time such that sample A following sample B in the media extractor track indicates that sample A follows sample B in the track referenced by track reference index 362. An assembly process may be used to place all samples in the sample group entry so that they are ordered.

以下の擬似コードは、メディアエクストラクタサンプルグループ３６０と同様のメディアエクストラクタサンプルグループクラスの例示的な定義を与える。

The following pseudo code gives an exemplary definition of a media extractor sample group class similar to the media extractor sample group 360.

図１４は、ＡＶＣファイルフォーマットに準拠するビデオファイルのコンテキストにおいて使用され得る例示的なメディアエクストラクタ３８０を示すブロック図である。メディアエクストラクタ３８０の例は、トラック参照インデックス３８２と、時間識別子値３８４と、予約済みビット３８６と、サンプルオフセット値３８８とを含む。トラック参照インデックス３８２およびサンプルオフセット値３８８は、それぞれトラック参照インデックス３０２およびサンプルオフセット値３０４と同様の方法で使用され得る。予約済みビット３８６は、将来の使用のために予約され得、この時点ではセマンティック値を割り当てられない。 FIG. 14 is a block diagram illustrating an example media extractor 380 that may be used in the context of a video file that conforms to the AVC file format. An example media extractor 380 includes a track reference index 382, a time identifier value 384, a reserved bit 386, and a sample offset value 388. The track reference index 382 and sample offset value 388 may be used in a similar manner as the track reference index 302 and sample offset value 304, respectively. Reserved bit 386 may be reserved for future use and is not assigned a semantic value at this time.

時間識別子値３８４は、メディアエクストラクタ３８０によって抽出されたサンプルの時間レベルを指定する。一例では、時間レベルは０以上７以下の範囲内にある。上記で説明したように、符号化されたピクチャは時間レベルに対応し得、時間レベルは、概してフレーム間の符号化階層を記述する。たとえば、（アンカーフレームとも呼ばれる）キーフレームは最高時間レベルを割り当てられ得、参照フレームとして使用されないフレームは相対的により低い時間レベルを割り当てられ得る。このようにして、メディアエクストラクタ３８０は、サンプル自体を明示的に識別するのではなく、サンプルの時間レベルを参照することによって、トラック参照インデックス３８２によって参照されたトラックから抽出されたサンプルを識別し得る。時間識別子値３８４によって定義される値よりも高い値までのメディアエクストラクタをもつメディアエクストラクタトラックは、より高いフレームレートをもつ動作点に対応し得る。 The time identifier value 384 specifies the time level of the sample extracted by the media extractor 380. In one example, the time level is in the range of 0 to 7. As explained above, an encoded picture may correspond to a temporal level, which generally describes the encoding hierarchy between frames. For example, key frames (also referred to as anchor frames) can be assigned the highest time level, and frames not used as reference frames can be assigned a relatively lower time level. In this way, the media extractor 380 identifies samples extracted from the track referenced by the track reference index 382 by referring to the time level of the sample rather than explicitly identifying the sample itself. obtain. Media extractor tracks with media extractors up to a value defined by the time identifier value 384 may correspond to operating points with higher frame rates.

以下の擬似コードは、メディアエクストラクタ３８０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code gives an exemplary definition of a media extractor class similar to media extractor 380.

図１５は、メディアエクストラクタトラックを含むようにＭＶＣを変更するために使用され得る例示的なＭＶＣメディアエクストラクタ４２０を示すブロック図である。メディアエクストラクタ４２０の例は、随意のＮＡＬユニットヘッダ４２２と、トラック参照インデックス４２４と、サンプルオフセット４２６と、連続バイトセットカウント４２８と、データオフセット値４３０およびデータ長値４３２を含む値のループとを含む。ＭＶＣメディアエクストラクタ４２０は、特定のトラックからビュー構成要素のサブセットのいくつかのＮＡＬユニットを抽出するために使用され得る。ＭＶＣメディアエクストラクタ４２０の例は、参照されたトラックのサンプルからデータを抽出するときにトラック中のビュー構成要素をスキップすることができる。 FIG. 15 is a block diagram illustrating an example MVC media extractor 420 that may be used to modify the MVC to include a media extractor track. An example media extractor 420 includes an optional NAL unit header 422, a track reference index 424, a sample offset 426, a continuous byte set count 428, and a loop of values that includes a data offset value 430 and a data length value 432. . The MVC media extractor 420 can be used to extract several NAL units of a subset of view components from a particular track. The example MVC media extractor 420 can skip view components in a track when extracting data from a sample of the referenced track.

存在するとき、ＮＡＬユニットヘッダ４２２は、ＭＶＣメディアエクストラクタ４２０によって識別されたＮＡＬユニットのＮＡＬユニットヘッダをミラーリングし得る。すなわち、ＮＡＬユニットヘッダ４２２のシンタックス要素は、ＭＶＣファイルフォーマットで定義されたエクストラクタまたはアグリゲータ生成プロセスにおけるＮＡＬユニットヘッダシンタックスに従って生成され得る。いくつかの例では、たとえば、関係するＮＡＬユニットヘッダを含めるために一連のエクストラクタが生成されるとき、エクストラクタはＮＡＬユニットヘッダ４２２を必要としないことがある。 When present, NAL unit header 422 may mirror the NAL unit header of the NAL unit identified by MVC media extractor 420. That is, the syntax element of the NAL unit header 422 may be generated according to the NAL unit header syntax in the extractor or aggregator generation process defined in the MVC file format. In some examples, the extractor may not require the NAL unit header 422, for example, when a series of extractors are generated to include the relevant NAL unit header.

トラック参照インデックス値４２４は、データを抽出すべきトラックを発見するために使用するトラック参照のインデックスを指定する。データが抽出されるトラック中のサンプルは、サンプルオフセット値４２６によって指定されたオフセットだけ調整された、メディア復号タイムラインにおいて、ＭＶＣメディアエクストラクタ４２０を含んでいるサンプルに時間的に整合され得る。第１のトラック参照は、インデックス値１を受信するように指定され得、トラック参照インデックス値の値０が予約され得る。 The track reference index value 424 specifies the index of the track reference used to find the track from which data is to be extracted. The samples in the track from which data is extracted can be temporally aligned with the samples containing the MVC media extractor 420 in the media decoding timeline, adjusted by the offset specified by the sample offset value 426. The first track reference may be designated to receive an index value of 1, and a track reference index value of 0 may be reserved.

サンプルオフセット値４２６は、トラック参照インデックス値４２４によって参照されたトラック中にある抽出されるべきサンプルの、ＭＶＣメディアエクストラクタ４２０の時間ロケーションに対するオフセットを定義する。サンプルオフセット値４２６の値０は、抽出すべきサンプルが同じ時間ロケーションにあることを示し、−１は前のサンプルを示し、＋１は次のサンプルを示し、以下同様である。 Sample offset value 426 defines the offset of the sample to be extracted in the track referenced by track reference index value 424 relative to the time location of MVC media extractor 420. A sample offset value 426 value of 0 indicates that the sample to be extracted is at the same time location, -1 indicates the previous sample, +1 indicates the next sample, and so on.

連続バイトセットカウント４２８は、データを抽出すべきトラックのサンプルの連続バイトセットの数を記述する。連続バイトセットカウント４２８が値０を有する場合、トラック中の参照されたサンプル全体が抽出されることになる。連続バイトセットはまた、サンプルの別々の部分として参照され得る。 Consecutive byte set count 428 describes the number of consecutive byte sets of samples of the track from which data is to be extracted. If the continuous byte set count 428 has the value 0, the entire referenced sample in the track will be extracted. A contiguous byte set can also be referred to as a separate part of the sample.

データオフセット値４３０およびデータ長値４３２はループにおいて発生する。概して、ループの反復回数、すなわち、データオフセット値４３０およびデータ長値４３２の数は、抽出されるべきサンプルの部分の数（たとえば、連続バイトセットの数）に関係する。このようにして、ＭＶＣメディアエクストラクタ４２０を使用してサンプルの２つ以上の部分が抽出され得る。抽出されるべきサンプルの部分ごとに、データオフセット値４３０のうちの対応する１つが部分の開始（たとえば、サンプルの最初のバイトに対する、部分の最初のバイト）を示し、データ長値４３２のうちの対応する１つが、コピーすべき長さ、たとえば、バイトの数を示す。いくつかの例では、データ長値４３２のうちの１つの値０は、サンプル中のすべての残りのバイトをコピーすべきであること、すなわち、部分が、データオフセット値４３０のうちの対応する１つによって示されたバイトと、サンプルの終端までのすべての他の連続バイトとに対応することを示し得る。 Data offset value 430 and data length value 432 occur in the loop. In general, the number of loop iterations, ie, the number of data offset values 430 and data length values 432, is related to the number of portions of the sample to be extracted (eg, the number of consecutive byte sets). In this way, more than one part of the sample can be extracted using the MVC media extractor 420. For each portion of the sample to be extracted, a corresponding one of the data offset values 430 indicates the start of the portion (eg, the first byte of the portion relative to the first byte of the sample) and The corresponding one indicates the length to be copied, for example the number of bytes. In some examples, a value 0 of one of the data length values 432 should copy all remaining bytes in the sample, i.e., the portion corresponds to the corresponding 1 of the data offset value 430. May correspond to the byte indicated by one and all other consecutive bytes up to the end of the sample.

以下の擬似コードは、ＭＶＣメディアエクストラクタ４２０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code gives an exemplary definition of a media extractor class similar to MVC media extractor 420.

図１６は、メディアエクストラクタトラックを含むようにＭＶＣを変更するために使用され得る別の例示的なＭＶＣメディアエクストラクタ４４０を示すブロック図である。ＭＶＣメディアエクストラクタ４４０の例は、図１５の例に関して説明したサンプルの固有のバイトとは反対に、抽出のための特定のＮＡＬユニットを識別する。図１６の例では、ＭＶＣメディアエクストラクタ４４０は、随意のＮＡＬユニットヘッダ４４２と、トラック参照インデックス４４４と、サンプルオフセット４４６と、連続ＮＡＬＵ（ＮＡＬユニット）セットカウント４４８と、ＮＡＬＵオフセット値４５０および連続ＮＡＬユニットの数４５２のループとを含む。ＮＡＬユニットヘッダ４４２、トラック参照インデックス４４４、およびサンプルオフセット値４４６は、概して、それぞれＮＡＬユニットヘッダ４２２、トラック参照インデックス４２４、およびサンプルオフセット値４２６と同様に定義される。 FIG. 16 is a block diagram illustrating another example MVC media extractor 440 that may be used to modify the MVC to include a media extractor track. The MVC media extractor 440 example identifies a specific NAL unit for extraction, as opposed to the sample specific bytes described with respect to the example of FIG. In the example of FIG. 16, the MVC media extractor 440 includes an optional NAL unit header 442, a track reference index 444, a sample offset 446, a continuous NALU (NAL unit) set count 448, a NALU offset value 450, and a continuous NAL unit. And 452 loops. NAL unit header 442, track reference index 444, and sample offset value 446 are generally defined similarly to NAL unit header 422, track reference index 424, and sample offset value 426, respectively.

連続ＮＡＬＵセットカウント４４８は、データを抽出すべきトラックのサンプルの連続ＮＡＬユニットの数を記述する。いくつかの例では、この値が０に設定された場合、トラック中の参照されたサンプル全体が抽出される。 The continuous NALU set count 448 describes the number of consecutive NAL units of the sample of the track from which data is to be extracted. In some examples, if this value is set to 0, the entire referenced sample in the track is extracted.

ＮＡＬＵオフセット値４５０および連続ＮＡＬＵの数４５２はループにおいて発生する。概して、連続ＮＡＬＵセットカウント４４８によって定義された、連続ＮＡＬＵのセットと同数のＮＡＬＵオフセット値のインスタンスおよび連続ＮＡＬＵの数がある。各ＮＡＬＵオフセット値は、データを抽出すべきトラックのサンプルにおける対応するＮＡＬユニットのオフセットを記述する。ＮＡＬユニットのうちのこのオフセットから開始するＮＡＬユニットは、このエクストラクタを使用して抽出され得る。連続ＮＡＬＵ値の各数は、ＮＡＬユニットの対応するセットのためにコピーすべき、単一の参照されたＮＡＬユニット全体の数を記述する。 The NALU offset value 450 and the number of consecutive NALUs 452 occur in the loop. In general, there are as many instances of NALU offset values and consecutive NALUs as there are consecutive NALU sets, as defined by the consecutive NALU set count 448. Each NALU offset value describes the offset of the corresponding NAL unit in the sample of the track from which data is to be extracted. NAL units starting from this offset of NAL units can be extracted using this extractor. Each number of consecutive NALU values describes the total number of single referenced NAL units that should be copied for the corresponding set of NAL units.

以下の擬似コードは、ＭＶＣメディアエクストラクタ４４０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code provides an exemplary definition of a media extractor class similar to MVC media extractor 440.

図１７は、ビュー構成要素のための２つ以上のＮＡＬユニットがあるとき、同じビュー構成要素中のＮＡＬユニットをアグリゲートする別の例示的なＭＶＣメディアエクストラクタ４６０を示すブロック図である。その場合、ＭＶＣメディアエクストラクタ４６０は、識別されたビュー構成要素を抽出するために使用され得る。図１７の例では、ＭＶＣメディアエクストラクタ４６０は、随意のＮＡＬユニットヘッダ４６２と、トラック参照インデックス４６４と、サンプルオフセット４６６と、連続ビューセットカウント４６８と、ビュー構成要素オフセット値４７０およびビュー構成要素カウント４７２のループとを含む。ＮＡＬユニットヘッダ４６２、トラック参照インデックス４６４、およびサンプルオフセット値４６６は、概して、それぞれＮＡＬユニットヘッダ４２２、トラック参照インデックス４２４、およびサンプルオフセット値４２６と同様に定義される。 FIG. 17 is a block diagram illustrating another example MVC media extractor 460 that aggregates NAL units in the same view component when there are two or more NAL units for the view component. In that case, MVC media extractor 460 may be used to extract the identified view components. In the example of FIG. 17, the MVC media extractor 460 includes an optional NAL unit header 462, a track reference index 464, a sample offset 466, a continuous view set count 468, a view component offset value 470, and a view component count 472. Loops. NAL unit header 462, track reference index 464, and sample offset value 466 are generally defined similarly to NAL unit header 422, track reference index 424, and sample offset value 426, respectively.

連続ビューセットカウント４６８は、データを抽出すべき、トラック参照インデックス４６４によって識別されたトラック中の識別されたサンプルの連続ビュー構成要素の数を定義する。マルチプレクサ３０は、トラック中の参照されたサンプル全体が抽出されるべきであることを示すために、連続ビューセットカウント４６８の値を０に設定し得る。 The continuous view set count 468 defines the number of consecutive view components of the identified sample in the track identified by the track reference index 464 from which data is to be extracted. Multiplexer 30 may set the continuous view set count 468 value to 0 to indicate that the entire referenced sample in the track should be extracted.

ビュー構成要素オフセット値４７０およびビュー構成要素カウント４７２はループにおいて発生する。概して、連続ビューセットカウント４６８の値と同数のループの反復があり、各ループは連続ビューセットのうちの１つに対応する。ビュー構成要素オフセット値４７０の各々は、対応する連続ビューセットのためのデータを抽出すべきトラックのサンプルにおける最初のビュー構成要素のオフセットを示す。次いで、ビュー構成要素のうちのこのオフセットから開始するビュー構成要素は、ＭＶＣメディアエクストラクタ４６０を使用して抽出され得る。ビュー構成要素カウント４７２の各々は、対応する連続ビューセットのためのコピーすべきサンプル中の参照されたビュー構成要素全体の数を記述する。 View component offset value 470 and view component count 472 occur in a loop. In general, there are as many loop iterations as the value of the continuous view set count 468, with each loop corresponding to one of the continuous view sets. Each view component offset value 470 indicates the offset of the first view component in the sample of the track from which data for the corresponding continuous view set is to be extracted. The view component starting from this offset of the view components can then be extracted using the MVC media extractor 460. Each view component count 472 describes the total number of referenced view components in the sample to be copied for the corresponding continuous view set.

以下の擬似コードは、ＭＶＣメディアエクストラクタ４６０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code gives an exemplary definition of a media extractor class similar to MVC media extractor 460.

図１８は、様々なトラックを参照するために使用され得るＭＶＣメディアエクストラクタ４８０の別の例を示すブロック図である。図１８の例では、ＭＶＣメディアエクストラクタ４８０は、随意のＮＡＬユニットヘッダ４８２と、連続ビューセットカウント４８４と、サンプルオフセット値４８６、トラック参照インデックス値４８８、ビュー構成要素オフセット値４９０、およびビュー構成要素カウント４９２のループとを含む。ＮＡＬユニットヘッダ４８２は、ＮＡＬユニットヘッダ４２２と同様に定義され得、いくつかの例では省略され得る。 FIG. 18 is a block diagram illustrating another example of an MVC media extractor 480 that may be used to reference various tracks. In the example of FIG. 18, the MVC media extractor 480 includes an optional NAL unit header 482, a continuous view set count 484, a sample offset value 486, a track reference index value 488, a view component offset value 490, and a view component count. 492 loops. The NAL unit header 482 may be defined similarly to the NAL unit header 422 and may be omitted in some examples.

連続ビューセットカウント４８４は、ｔｒａｃｋ＿ｒｅｆ＿ｉｎｄｅｘのトラック参照インデックスをもつ、データを抽出すべきメディアエクストラクタトラックのサンプルの連続ビュー構成要素の数を与える。ｔｒａｃｋ＿ｒｅｆ＿ｉｎｄｅｘは、データを抽出すへきトラックを発見するために使用すべきトラック参照のインデックスを指定し得る。データが抽出されるトラック中のビュー構成要素は、（時間サンプル表を使用して、サンプルオフセット値４８６のうちの対応する１つによって指定されたオフセットだけ調整された、メディア復号タイムラインにおいて、）ＭｅｄｉａＥｘｔｒａｃｔｏｒＭＶＣを含んでいるサンプルに時間的に整合され得る。第１のトラック参照はインデックス値１を有し得、値０は将来の使用のために予約され得る。 The continuous view set count 484 gives the number of continuous view components of the sample of the media extractor track from which data is to be extracted with a track reference index of track_ref_index. track_ref_index may specify the index of the track reference that should be used to find the track from which to extract data. The view component in the track from which the data is extracted is (in the media decoding timeline, adjusted using the time sample table, adjusted by the offset specified by the corresponding one of the sample offset values 486) It can be time aligned to the sample containing the MediaExtractorMVC. The first track reference may have an index value of 1, and a value of 0 may be reserved for future use.

ＭＶＣメディアエクストラクタ４８０の例は、サンプルオフセット値４８６と、トラック参照インデックス値４８８と、ビュー構成要素オフセット値４９０と、ビュー構成要素カウント４９２との各々をループ中に含む。ループの各反復は、ＭＶＣメディアエクストラクタ４８０に対応するサンプルのためのデータを抽出すべき特定のトラックに対応する。 The example MVC media extractor 480 includes a sample offset value 486, a track reference index value 488, a view component offset value 490, and a view component count 492 in the loop. Each iteration of the loop corresponds to a particular track from which data for the sample corresponding to MVC media extractor 480 is to be extracted.

サンプルオフセット値４８６は、トラック参照インデックス値４８８のうちの対応する１つによって参照された、情報源として使用され得るトラック中のサンプルの相対インデックスを定義する。サンプル０は、ＭＶＣメディアエクストラクタ４８０を含んでいるサンプルと同じか、または最も近い先行する復号時間をもつトラック参照インデックス値４８８のうちの対応する１つによって識別されたトラック中のサンプルであり、サンプル１は次のサンプルであり、サンプル−１は前のサンプルであり、以下同様である。 Sample offset value 486 defines the relative index of the sample in the track that can be used as an information source, referenced by a corresponding one of the track reference index values 488. Sample 0 is the sample in the track identified by the corresponding one of the track reference index values 488 having the same or closest preceding decoding time as the sample containing MVC media extractor 480; Sample 1 is the next sample, Sample-1 is the previous sample, and so on.

トラック参照インデックス値４８８の各々は、ループの対応する反復のためのデータを抽出すべきトラックを発見するために使用すべきトラック参照のインデックスを指定する。複数のトラック参照インデックス値を使用することによって、ＭＶＣメディアエクストラクタ４８０は、複数の異なるトラックからデータを抽出し得る。 Each of the track reference index values 488 specifies the index of the track reference that should be used to find the track from which data is to be extracted for the corresponding iteration of the loop. By using multiple track reference index values, the MVC media extractor 480 may extract data from multiple different tracks.

ビュー構成要素オフセット値４９０の各々は、ループのこの反復におけるトラック参照インデックス値４８８のうちの対応する１つに対応するトラック参照インデックスをもつ、データを抽出すべきトラックのサンプルにおける第１のビュー構成要素のオフセットを記述する。ビュー構成要素のうちのこのオフセットから開始するビュー構成要素は、ＭＶＣメディアエクストラクタ４８０を使用して抽出され得る。いくつかの例では、外側のループは、サンプルが抽出されるべきトラックにわたって反復し、内側のループは、対応するトラックから抽出されるべきサンプルにわたって反復する、ネスティングされたループ構造を有する、図１５〜図１７のメディアエクストラクタと同様のメディアエクストラクタが構築され得る。ビュー構成要素カウント４９２の各々は、ループのこの反復におけるトラック参照インデックス値４８８の現在の値に対応するトラック参照インデックスをもつトラックのサンプル中の参照されたビュー構成要素の数を記述する。 Each view component offset value 490 has a track reference index corresponding to a corresponding one of the track reference index values 488 in this iteration of the loop, the first view configuration in the sample of the track from which data is to be extracted. Describes the offset of the element. View components starting from this offset of the view components can be extracted using the MVC media extractor 480. In some examples, the outer loop has a nested loop structure that repeats across the track from which the sample is to be extracted, and the inner loop repeats across the sample from which the sample is to be extracted, FIG. A media extractor similar to the media extractor of FIG. 17 can be constructed. Each view component count 492 describes the number of referenced view components in a sample of tracks having a track reference index corresponding to the current value of the track reference index value 488 in this iteration of the loop.

以下の擬似コードは、ＭＶＣメディアエクストラクタ４８０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code provides an exemplary definition of a media extractor class similar to MVC media extractor 480.

図１９は、エクストラクタの持続時間をシグナリングする別の例示的なＭＶＣメディアエクストラクタ５００を示すブロック図である。メディアエクストラクタトラック中の異なるサンプルがエクストラクタの同じシンタックス要素を共有するとき、ＭＶＣメディアエクストラクタ５００は１つまたは複数の利点を与え得る。図１９の例では、ＭＶＣメディアエクストラクタ５００は、サンプルカウント５０２と、連続ビューセットカウント５０４と、サンプルオフセット値５０６と、トラック参照インデックス５０８と、ビュー構成要素オフセット５１０と、ビュー構成要素カウント５１２とを含む。 FIG. 19 is a block diagram illustrating another example MVC media extractor 500 that signals the duration of the extractor. The MVC media extractor 500 may provide one or more advantages when different samples in the media extractor track share the same syntax elements of the extractor. In the example of FIG. 19, the MVC media extractor 500 includes a sample count 502, a continuous view set count 504, a sample offset value 506, a track reference index 508, a view component offset 510, and a view component count 512. Including.

連続ビューセットカウント５０４、サンプルオフセット値５０６、トラック参照インデックス５０８、ビュー構成要素オフセット５１０、およびビュー構成要素カウント５１２は、概して、連続ビューセットカウント４８４、サンプルオフセット値４８６、トラック参照インデックス４８８、ビュー構成要素オフセット４９０、およびビュー構成要素カウント４９２のうちの対応する１つに従って定義され得る。サンプルカウント５０２は、同じメディアエクストラクタを使用するメディアエクストラクタトラックを含んでいるＭＶＣメディアエクストラクタ５００中の連続サンプルの数を定義し得る。 Continuous view set count 504, sample offset value 506, track reference index 508, view component offset 510, and view component count 512 generally include continuous view set count 484, sample offset value 486, track reference index 488, view component offset. 490 and a corresponding one of view component counts 492 may be defined. Sample count 502 may define the number of consecutive samples in MVC media extractor 500 that include media extractor tracks that use the same media extractor.

以下の擬似コードは、ＭＶＣメディアエクストラクタ５００と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code gives an exemplary definition of a media extractor class similar to the MVC media extractor 500.

図２０は、異なるエクストラクタのセットを定義する別の例示的なＭＶＣメディアエクストラクタ５２０を示すブロック図である。メディアエクストラクタトラック中のサンプルごとに、サンプルは、エクストラクタのセットのうちの１つまたは複数、あるいはエクストラクタへの参照のいずれかを使用することができる。すなわち、ＭＶＣメディアエクストラクタ５２０と同様のメディアエクストラクタのセットが定義され得、各サンプルは、別のトラックのサンプルを識別するために、エクストラクタのセットのうちの１つまたは複数、あるいはエクストラクタへの参照のいずれかを使用し得る。 FIG. 20 is a block diagram illustrating another example MVC media extractor 520 that defines a different set of extractors. For each sample in the media extractor track, the sample can use either one or more of the set of extractors, or a reference to the extractor. That is, a set of media extractors similar to the MVC media extractor 520 can be defined, each sample being one or more of the set of extractors, or an extractor, to identify a sample of another track. Any of the references to can be used.

ＭＶＣメディアエクストラクタ５２０の例は、エクストラクタ識別子値５２２と、サンプルオフセット値５２４と、トラック参照インデックス値５２６と、連続ビューセットカウント５２８と、ビュー構成要素オフセット５３０およびビュー構成要素カウント５３２を含むループとを含む。サンプルオフセット値５２４、連続ビューセットカウント５２８、ビュー構成要素オフセット５３０、およびビュー構成要素カウント５３２は、連続ビューセットカウント４８４、サンプルオフセット値４８６、ビュー構成要素オフセット４９０、およびビュー構成要素カウント４９２のうちの対応する１つに従って定義され得る。トラック参照インデックス値５２６は、たとえば、トラック参照インデックス４６４に従って定義され得る。 An example of an MVC media extractor 520 includes an extractor identifier value 522, a sample offset value 524, a track reference index value 526, a continuous view set count 528, a loop that includes a view component offset 530 and a view component count 532. including. Sample offset value 524, continuous view set count 528, view component offset 530, and view component count 532 correspond to continuous view set count 484, sample offset value 486, view component offset 490, and view component count 492. Can be defined according to one The track reference index value 526 may be defined according to the track reference index 464, for example.

エクストラクタ識別子値５２２は、エクストラクタ、すなわち、ＭＶＣメディアエクストラクタ５２０の識別子を定義する。メディアエクストラクタトラック中のサンプルが、メディアエクストラクタを使用するためにエクストラクタ識別子値を参照し得るように、同じメディアエクストラクタトラック中のエクストラクタは、異なるエクストラクタ識別子値を割り当てられる。参照エクストラクタボックスはまた、エクストラクタの数と参照エクストラクタ識別子とを含むように定義され得る。エクストラクタの数の値は、エクストラクタトラック中のサンプルのためのデータをコピーするために使用されるエクストラクタの数を与え得る。エクストラクタの数の値が０に等しいとき、所定のエクストラクタ識別子、たとえば、０に等しいエクストラクタ識別子を有するエクストラクタが使用され得る。参照エクストラクタ識別子は、エクストラクタトラック中のサンプルのためのデータをコピーするために使用されるエクストラクタのエクストラクタ識別子を与え得る。このボックスはメディアエクストラクタトラックのサンプル中に含まれ得る。 The extractor identifier value 522 defines the identifier of the extractor, ie, the MVC media extractor 520. Extractors in the same media extractor track are assigned different extractor identifier values so that samples in the media extractor track can reference extractor identifier values to use the media extractor. The reference extractor box may also be defined to include the number of extractors and the reference extractor identifier. The number of extractors value may give the number of extractors that are used to copy the data for the samples in the extractor track. When the value of the number of extractors is equal to 0, an extractor having a predetermined extractor identifier, eg, an extractor identifier equal to 0, may be used. The reference extractor identifier may give the extractor identifier of the extractor used to copy the data for the samples in the extractor track. This box can be included in a sample of a media extractor track.

以下の擬似コードは、ＭＶＣメディアエクストラクタ５２０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code gives an exemplary definition of a media extractor class similar to MVC media extractor 520.

以下の擬似コードは、上記で説明した参照エクストラクタボックスのための参照エクストラクタボックスクラスの例示的な定義を与える。

The following pseudo code gives an exemplary definition of a reference extractor box class for the reference extractor box described above.

図２１は、マップサンプルグループを使用して形成され得る例示的なＭＶＣメディアエクストラクタ５５０を示すブロック図である。ＭＶＣメディアエクストラクタ５５０の例は、それぞれがマップサンプルグループ中の連続ＮＡＬユニットを与える、一連のサンプルエントリからのＮＡＬユニットグループを指定する。図２２の例では、ＭＶＣメディアエクストラクタ５５０は、ＮＡＬＵグループカウント５５２と、トラックインデックス５５４、グループ記述インデックス５５６、ＮＡＬＵ開始マップサンプル５５８、およびＮＡＬＵビューカウント５６０を含むループとを含む。 FIG. 21 is a block diagram illustrating an example MVC media extractor 550 that may be formed using map sample groups. The example MVC media extractor 550 specifies a NAL unit group from a series of sample entries, each giving a contiguous NAL unit in a map sample group. In the example of FIG. 22, the MVC media extractor 550 includes a NALU group count 552 and a loop that includes a track index 554, a group description index 556, a NALU start map sample 558, and a NALU view count 560.

ＮＡＬＵグループカウント５５２は、参照トラック中のマップサンプルグループエントリからのＮＡＬユニットグループの数を指定する。トラック参照インデックス値５５４は、それぞれループの対応する反復のためのデータを抽出すべきトラックを発見するために使用すべきトラック参照のインデックスを指定する。グループ記述インデックス５５６は、それぞれループの対応する反復のためのＮＡＬユニットグループを形成するために使用されるマップサンプルグループエントリのインデックスを指定する。ＮＡＬＵ開始マップサンプル５５８は、それぞれループの対応する反復におけるグループ記述インデックス５５６のうちの対応する１つのマップサンプルエントリインデックスをもつマップサンプルグループ中のＮＡＬユニットのオフセットを指定する。ＮＡＬＵビューカウント５６０は、ループの対応する反復におけるグループ記述インデックス５５６のうちの対応する１つのマップサンプルエントリインデックスをもつマップサンプルグループ中のメディアエクストラクタ中に抽出されるべき連続ＮＡＬユニットの数を指定する。 The NALU group count 552 specifies the number of NAL unit groups from the map sample group entry in the reference track. The track reference index value 554 specifies the index of the track reference to be used to find the track from which data for each corresponding iteration of the loop is to be extracted. The group description index 556 specifies the index of the map sample group entry that is used to form the NAL unit group for each corresponding iteration of the loop. The NALU start map sample 558 specifies the offset of the NAL unit in the map sample group, each having a corresponding one map sample entry index of the group description index 556 in the corresponding iteration of the loop. The NALU view count 560 specifies the number of consecutive NAL units to be extracted into the media extractor in the map sample group with the corresponding one map sample entry index of the group description index 556 in the corresponding iteration of the loop. To do.

以下の擬似コードは、ＭＶＣメディアエクストラクタ５５０と同様のメディアエクストラクタクラスの例示的な定義を与える。

The following pseudo code provides an exemplary definition of a media extractor class similar to MVC media extractor 550.

本開示の技法は、サンプルグループ中のサンプルのビュー構成要素を構成するためのアセンブルプロセスを含み得る。サンプルグループエントリのサンプル中のビュー構成要素は、サンプルＡが（トラック参照インデックスのインデックスをもつ）元のトラック中のサンプルＢに続く場合、サンプルＡ中のビュー構成要素がメディアエクストラクタトラック中のサンプルＢ中のビュー構成要素に続き、サンプルＡがサンプルＢよりも前の復号時間を有する場合、サンプルＡ中のビュー構成要素がメディアエクストラクタトラック中のサンプルＢ中のビュー構成要素に続き、トラックの同じサンプル中の２つのビュー構成要素は、メディアエクストラクタマップサンプルグループのシンタックステーブル中の提示の順序に従い、トラックの同じサンプル中の２つのビュー構成要素がＮＡＬユニットの同じグループに属する場合、すなわち、それらがメディアエクストラクタマップサンプルグループ中の同じループのシンタックス要素によって抽出された場合、それらは元の順序に従い、２つのビュー構成要素が異なるトラック中のサンプルから抽出されたが、同じタイムスタンプをもつ場合、それらはビュー識別子ボックス中に指定された順序インデックスの順序に従うように、適時に順序付ける。 The techniques of this disclosure may include an assembly process for constructing sample view components in a sample group. The view component in the sample of the sample group entry is the sample in the media extractor track if the sample A follows sample B in the original track (with the index of the track reference index). Following the view component in B, if sample A has a decoding time before sample B, the view component in sample A follows the view component in sample B in the media extractor track, and Two view components in the same sample follow the order of presentation in the syntax table of the media extractor map sample group, and if two view components in the same sample of a track belong to the same group of NAL units, ie They are When extracted by the same loop syntax elements in a tractor map sample group, they follow the original order and if two view components are extracted from samples in different tracks but have the same timestamp, they Will order in a timely manner to follow the order of the order index specified in the view identifier box.

図２２は、トラック選択ボックスの追加の属性をシグナリングする例示的な変更された３ＧＰＰトラック選択ボックス３９０を示すブロック図である。この著述時点での、直近の３ＧＰＰ規格は、言語と、帯域幅と、コーデックと、スクリーンサイズと、最大パケットサイズと、メディアタイプとを記述する属性を含むＡｔｔｒｉｂｕｔｅＬｉｓｔを指定する。３ＧＰＰトラック選択ボックス３９０の属性リスト３９２は、言語値３９４と、帯域幅値３９６と、コーデック値３９８と、スクリーンサイズ値４００とを含み、既存の３ＧＰＰ規格に従ってこれらの属性をシグナリングする。さらに、本開示の技法は、フレームレート値４０６と、時間識別子値４０８と、場合によってはディスプレイビュー数値４１０と、出力ビューリスト値４１２とを含むように既存の３ＧＰＰトラック選択ボックスを変更し得る。 FIG. 22 is a block diagram illustrating an exemplary modified 3GPP track selection box 390 that signals additional attributes of the track selection box. At the time of this writing, the most recent 3GPP standard specifies an AttributeList containing attributes describing language, bandwidth, codec, screen size, maximum packet size, and media type. The attribute list 392 of the 3GPP track selection box 390 includes a language value 394, a bandwidth value 396, a codec value 398, and a screen size value 400, and signals these attributes according to existing 3GPP standards. Further, the techniques of this disclosure may modify an existing 3GPP track selection box to include a frame rate value 406, a time identifier value 408, possibly a display view value 410, and an output view list value 412.

言語値３９４は、既存の３ＧＰＰ規格の５．３．３．４章において定義されている、セッションレベルＳＤＰにおける「ａｌｔグループ」属性のグループタイプＬＡＮＧの値を定義する。帯域幅値３９６は、メディアレベルＳＤＰにおける「ｂ＝ＡＳ」属性の値を定義する。コーデック値３９８は、メディアトラックのサンプル記述ボックス中のＳａｍｐｌｅＥｎｔｒｙ値を定義する。スクリーンサイズ値４００は、メディアトラック中のＭＰ４ＶｉｓｕａｌＳａｍｐｌｅＥｎｔｒｙ値およびＨ２６３ＳａｍｐｌｅＥｎｔｒｙ値の幅および高さフィールドを定義する。最大パケットサイズ値４０２は、ＲＴＰＨｉｎｔＳａｍｐｌｅＥｎｔｒｙ中、たとえば、ＲＴＰヒントトラック中のＭａｘＰａｃｋｅｔＳｉｚｅフィールドの値を定義する。メディアタイプ値４０４は、メディアトラックのハンドラボックス中のＨａｎｄｌｅｒＴｙｐｅを記述する。概して、これらの値は既存の３ＧＰＰ規格に対応する。 The language value 394 defines the value of the group type LANG of the “alt group” attribute in the session level SDP, which is defined in Chapter 5.3.3.4 of the existing 3GPP standard. The bandwidth value 396 defines the value of the “b = AS” attribute in the media level SDP. Codec value 398 defines the SampleEntry value in the sample description box of the media track. Screen size value 400 defines the width and height fields of the MP4VisualSampleEntry value and the H263SampleEntry value in the media track. The maximum packet size value 402 defines the value of the MaxPacketSize field in RTPHintSampleEntry, eg, in the RTP hint track. Media type value 404 describes the HandlerType in the handler box of the media track. In general, these values correspond to existing 3GPP standards.

フレームレート値４０６は、３ＧＰＰトラック選択ボックス３９０に対応するビデオトラックまたはメディアエクストラクタトラックのフレームレートを記述する。時間識別子値４０８は、３ｇＰＰトラック選択ボックス３９０に対応するビデオトラックの時間識別子に対応し、より低い時間識別子値をもつトラックに依存し得る。いくつかの例では、マルチプレクサ３０は、時間識別子値４０８の値を事前構成された「指定なし」値、たとえば、８に設定することによって、その値が指定されていないことを示すことができる。概して、マルチプレクサ３０は、非ビデオトラックのための時間識別子値４０８の値が指定されないことを示し得る。いくつかの例では、マルチプレクサ３０はまた、対応するビデオトラックがメディアエクストラクタを含んでいないとき、および／または時間サブセットとして他のトラックによって参照されないとき、時間識別子値４０８の値が指定されないことを示し得る。 Frame rate value 406 describes the frame rate of the video track or media extractor track corresponding to 3GPP track selection box 390. The time identifier value 408 corresponds to the time identifier of the video track corresponding to the 3gPP track selection box 390 and may depend on the track with the lower time identifier value. In some examples, the multiplexer 30 may indicate that the value is not specified by setting the value of the time identifier value 408 to a preconfigured “unspecified” value, eg, 8. In general, multiplexer 30 may indicate that the value of time identifier value 408 for a non-video track is not specified. In some examples, multiplexer 30 also indicates that the value of time identifier value 408 is not specified when the corresponding video track does not include a media extractor and / or is not referenced by other tracks as a time subset. Can show.

３ＧＰＰにおいてＭＶＣが考慮される例では、マルチプレクサ３０は、ディスプレイビュー数の値４１０と出力ビューリスト値４１２との追加の属性を含み得る。そのような例では、マルチプレクサ３０は時間識別子値４０８を省略し得る。ディスプレイビュー数の値４１０は、対応するトラックのための出力されるべきビューの数を記述する。たとえば、表示されるべきビューが表示されないビューを参照して符号化されるとき、出力されるべきビューの数と復号されるべきビューの数とは必ずしも同じでない。出力ビューリスト値４１２は、出力されるべきＮ個のビューを識別するＮ個のビュー識別子のリストを定義し得る。 In an example where MVC is considered in 3GPP, multiplexer 30 may include additional attributes of display view number value 410 and output view list value 412. In such an example, multiplexer 30 may omit time identifier value 408. The display view number value 410 describes the number of views to be output for the corresponding track. For example, when a view to be displayed is encoded with reference to a view that is not displayed, the number of views to be output and the number of views to be decoded are not necessarily the same. The output view list value 412 may define a list of N view identifiers that identify the N views to be output.

図２３は、本開示の技法による、メディアエクストラクタを使用するための例示的な方法を示すフローチャートである。初めに、Ａ／Ｖソースデバイス２０（図１）などのソースデバイスは、本開示の技法に従って、ファイルフォーマットに準拠するファイルのためのビデオトラックを構築する。すなわち、マルチプレクサ３０は、ビデオトラックが１つまたは複数のＮＡＬユニットを含む符号化されたビデオサンプルを含むように、トラック中の符号化されたビデオデータをアセンブルする（６００）。マルチプレクサ３０はまた、ビデオトラックの１つまたは複数のＮＡＬユニットの一部または全部を参照するエクストラクタを構築し（６０２）、エクストラクタを含むエクストラクタトラックを構築する（６０４）。さらに、マルチプレクサ３０は、符号化されたビデオサンプルを、メディアエクストラクタトラック中、ならびに符号化されたビデオサンプルおよび／またはメディアエクストラクタを含む追加のトラック中に含め得る。 FIG. 23 is a flowchart illustrating an exemplary method for using a media extractor according to the techniques of this disclosure. Initially, a source device, such as A / V source device 20 (FIG. 1), constructs a video track for a file that conforms to the file format in accordance with the techniques of this disclosure. That is, multiplexer 30 assembles the encoded video data in the track so that the video track includes encoded video samples that include one or more NAL units (600). Multiplexer 30 also builds an extractor that references some or all of one or more NAL units of the video track (602) and builds an extractor track that includes the extractor (604). In addition, multiplexer 30 may include encoded video samples in media extractor tracks and in additional tracks that include encoded video samples and / or media extractors.

次いで、マルチプレクサ３０はファイルを出力する（６０６）。ファイルは、送信機、トランシーバ、ネットワークインターフェース、モデム、または他の信号出力手段を介して信号に出力され得るか、または、ファイルは、ＵＳＢインターフェース、磁気メディアレコーダ、光レコーダ、または他のハードウェアインターフェースなどのハードウェアインターフェースを介して記憶媒体に出力され得る。 Next, the multiplexer 30 outputs the file (606). The file can be output to a signal via a transmitter, transceiver, network interface, modem, or other signal output means, or the file can be a USB interface, magnetic media recorder, optical recorder, or other hardware interface Or the like via a hardware interface.

Ａ／Ｖ宛先デバイス４０は、たとえば、信号を受信するかまたは記憶媒体を読み取ることによって、最終的にファイルを受信する（６０８）。デマルチプレクサ３８は、復号されるべき２つ（以上）のトラックのうちの１つを選択する（６１０）。デマルチプレクサ３８は、ビデオデコーダ４８の復号機能、ビデオ出力４４のレンダリング機能、または他の基準に基づいてトラックのうちの１つを選択し得る。エクストラクタトラックが選択されると、デマルチプレクサ３８は、エクストラクタによって識別された符号化されたビデオサンプルが記憶されたトラックから、エクストラクタトラック中のエクストラクタによって参照されたＮＡＬユニットを取り出し得る。 A / V destination device 40 ultimately receives the file (608), for example, by receiving a signal or reading a storage medium. Demultiplexer 38 selects one of the two (or more) tracks to be decoded (610). The demultiplexer 38 may select one of the tracks based on the decoding function of the video decoder 48, the rendering function of the video output 44, or other criteria. When an extractor track is selected, the demultiplexer 38 may retrieve the NAL unit referenced by the extractor in the extractor track from the track where the encoded video sample identified by the extractor is stored.

デマルチプレクサ３８は、選択されたトラック中にない、選択されたトラック中の少なくとも１つのエクストラクタによって識別されない符号化されたビデオサンプル（または他のＮＡＬユニット）を廃棄し得る。すなわち、デマルチプレクサ３８は、使用されないビデオデータを復号するタスクをビデオデコーダ４８に与える必要がないように、そのような符号化されたビデオサンプルをビデオデコーダ４８に送ることを回避し得る。 The demultiplexer 38 may discard encoded video samples (or other NAL units) that are not in the selected track and are not identified by at least one extractor in the selected track. That is, demultiplexer 38 may avoid sending such encoded video samples to video decoder 48 so that the task of decoding unused video data need not be provided to video decoder 48.

１つまたは複数の例では、説明した機能はハードウェア、ソフトウェア、ファームウェア、またはそれらの任意の組合せで実装され得る。ソフトウェアで実装する場合、機能は、１つまたは複数の命令またはコードとしてコンピュータ可読媒体上に記憶されるか、あるいはコンピュータ可読媒体を介して送信され得る。コンピュータ可読媒体は、ある場所から別の場所へのコンピュータプログラムの転送を可能にする任意の媒体を含むデータ記憶媒体または通信媒体などのコンピュータ可読記憶媒体を含み得る。データ記憶媒体は、本開示で説明する技法の実装のための命令、コードおよび／またはデータ構造を取り出すために１つまたは複数のコンピュータまたは１つまたは複数のプロセッサによってアクセスされ得る任意の利用可能な媒体であり得る。限定ではなく例として、そのようなコンピュータ可読記憶媒体は、ＲＡＭ、ＲＯＭ、ＥＥＰＲＯＭ、ＣＤ−ＲＯＭまたは他の光ディスクストレージ、磁気ディスクストレージまたは他の磁気ストレージデバイス、フラッシュメモリ、あるいは命令またはデータ構造の形態の所望のプログラムコードを記憶するために使用され得、コンピュータによってアクセスされ得る、任意の他の媒体を備えることができる。また、いかなる接続もコンピュータ可読媒体と適切に呼ばれる。たとえば、命令が、同軸ケーブル、光ファイバケーブル、ツイストペア、デジタル加入者回線（ＤＳＬ）、または赤外線、無線、およびマイクロ波などのワイヤレス技術を使用して、ウェブサイト、サーバ、または他のリモートソースから送信される場合、同軸ケーブル、光ファイバケーブル、ツイストペア、ＤＳＬ、または赤外線、無線、およびマイクロ波などのワイヤレス技術は、媒体の定義に含まれる。ただし、コンピュータ可読記憶媒体およびデータ記憶媒体は、接続、搬送波、信号、または他の一時媒体を含まないことを理解されたい。本明細書で使用するディスク（disk）およびディスク（disc）は、コンパクトディスク（disc）（ＣＤ）、レーザディスク（disc）、光ディスク（disc）、デジタル多用途ディスク（disc）（ＤＶＤ）、フロッピーディスク（disk）およびブルーレイ（登録商標）ディスク（disc）を含み、この場合、ディスク（disk）は、通常、データを磁気的に再生し、ディスク（disc）はデータをレーザで光学的に再生する。上記の組合せもコンピュータ可読媒体の範囲内に含めるべきである。 In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media may include computer-readable storage media such as data storage media or communication media including any medium that enables transfer of a computer program from one place to another. A data storage medium may be any available that can be accessed by one or more computers or one or more processors to retrieve instructions, code, and / or data structures for implementation of the techniques described in this disclosure. It can be a medium. By way of example, and not limitation, such computer readable storage media may be in the form of RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage device, flash memory, or instruction or data structure. Any other medium that can be used to store the desired program code and that can be accessed by a computer can be provided. Any connection is also properly termed a computer-readable medium. For example, instructions may be sent from a website, server, or other remote source using coaxial technology, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, wireless, and microwave. When transmitted, coaxial technologies, fiber optic cables, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the media definition. However, it should be understood that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other temporary media. Discs and discs used in this specification are compact discs (CD), laser discs, optical discs, digital versatile discs (DVD), floppy discs. (Disk) and Blu-ray (registered trademark) disc, in which case the disc typically reproduces data magnetically and the disc optically reproduces data with a laser. Combinations of the above should also be included within the scope of computer-readable media.

コンピュータ可読媒体中に符号化された命令は、１つまたは複数のデジタル信号プロセッサ（ＤＳＰ）など１つまたは複数のプロセッサ、汎用マイクロプロセッサ、特定用途向け集積回路（ＡＳＩＣ）、フィールドプログラマブルロジックアレイ（ＦＰＧＡ）、または他の等価な集積または個別論理回路によって実行され得る。したがって、本明細書で使用する「プロセッサ」という用語は、前述の構造、または本明細書で説明する技法の実装に好適な他の構造のいずれかを指すことがある。さらに、いくつかの態様では、本明細書で説明した機能は、符号化および復号のために構成された専用のハードウェアおよび／またはソフトウェアモジュール内に提供され得、あるいは複合コーデックに組み込まれ得る。また、本技法は、１つまたは複数の回路または論理要素中に十分に実装され得る。 The instructions encoded in a computer readable medium may include one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs). ) Or other equivalent integrated or discrete logic circuit. Thus, as used herein, the term “processor” may refer to either the structure described above or other structure suitable for implementation of the techniques described herein. Further, in some aspects, the functionality described herein may be provided in dedicated hardware and / or software modules configured for encoding and decoding, or may be incorporated into a composite codec. The techniques may also be fully implemented in one or more circuits or logic elements.

本開示の技法は、ワイヤレスハンドセット、集積回路（ＩＣ）またはＩＣのセット（たとえば、チップセット）を含む、多種多様なデバイスまたは装置において実施され得る。本開示では、開示する技法を実行するように構成されたデバイスの機能的態様を強調するために様々な構成要素、モジュール、またはユニットについて説明したが、それらの構成要素、モジュール、またはユニットを、必ずしも異なるハードウェアユニットによって実現する必要はない。むしろ、上記で説明したように、様々なユニットが、好適なソフトウェアおよび／またはファームウェアとともに、上記で説明したように１つまたは複数のプロセッサを含んで、コーデックハードウェアユニットにおいて組み合わせられるか、または相互動作ハードウェアユニットの集合によって与えられ得る。 The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (eg, a chip set). Although this disclosure has described various components, modules, or units in order to highlight the functional aspects of a device that is configured to perform the disclosed techniques, It is not necessarily realized by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit, including one or more processors, as described above, with suitable software and / or firmware, or mutually. It can be given by a set of operating hardware units.

様々な例について説明した。これらおよび他の例は以下の特許請求の範囲に入る。
以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。
〔１〕
ビデオデータを符号化するための方法であって、前記方法が、
ソースビデオデバイスによって、符号化されたビデオデータに基づいて、複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含む第１のトラックを構築することであって、前記ビデオサンプルがアクセスユニット中に含まれる、第１のトラックを構築することと、
前記ソースビデオデバイスによって、前記第１のトラックの前記ビデオサンプル中の前記複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含む第２のトラックを構築することであって、前記複数のＮＡＬユニットのうちの前記少なくとも１つが第１の識別されたＮＡＬユニットを備え、前記エクストラクタが前記アクセスユニットの第２のＮＡＬユニットを識別し、前記第１の識別されたＮＡＬユニットと前記第２の識別されたＮＡＬユニットとが非連続である、第２のトラックを構築することと、
前記第１のトラックと前記第２のトラックとを、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイル中に含めることと、
前記ビデオファイルを出力することと
を備える、方法。
〔２〕
前記ビデオファイルが前記ＩＳＯベースメディアファイルフォーマットに準拠する、〔１〕に記載の方法。
〔３〕
前記ビデオファイルが、スケーラブルビデオコーディング（ＳＶＣ）ファイルフォーマットと、アドバンストビデオコーディング（ＡＶＣ）ファイルフォーマットと、第３世代パートナーシッププロジェクト（３ＧＰＰ）ファイルフォーマットと、マルチビュービデオコーディング（ＭＶＣ）ファイルフォーマットとのうちの少なくとも１つに準拠する、〔１〕に記載の方法。
〔４〕
前記第２のトラックを構築することが、前記符号化されたデータに基づいて、前記第１のトラックの前記複数のＮＡＬユニット中に含まれていない１つまたは複数の追加のＮＡＬユニットを前記第２のトラック中に含めることをさらに備える、〔１〕に記載の方法。
〔５〕
前記第１のトラックの前記複数のＮＡＬユニットのうちの１つまたは複数を識別する第１のエクストラクタと、前記第２のトラックの前記１つまたは複数のＮＡＬユニットのうちの少なくとも１つを識別する第２のエクストラクタとを含む第３のトラックを構築することをさらに備える、〔４〕に記載の方法。
〔６〕
前記第３のトラックを構築することが、前記第１のトラックおよび前記第２のトラック中に含まれていない１つまたは複数のＮＡＬユニットを前記第３のトラック中に含めることをさらに備える、〔５〕に記載の方法。
〔７〕
前記第２のトラックを構築することが、前記第１のトラックの前記ビデオサンプルの前記複数のＮＡＬユニットの各々を識別するように前記エクストラクタを構築することを備え、前記エクストラクタが、宛先デバイスに、前記ビデオサンプルの前記複数のＮＡＬユニットの各々を全体として抽出させる、〔１〕に記載の方法。
〔８〕
前記第２のトラックを構築することが、前記ビデオファイルの前記第１のトラック中の前記ビデオサンプルの前記複数のＮＡＬユニットのうちの前記１つまたは複数のバイト範囲を指定することによって、前記ビデオサンプルの前記複数のＮＡＬユニットのうちの前記１つまたは複数を識別するように前記エクストラクタを構築することを備える、〔１〕に記載の方法。
〔９〕
前記第１のトラック中の前記ビデオサンプルの前記複数のＮＡＬユニットが、コモンピクチャのスライスと、非ビデオコーディングレイヤ（ＶＣＬ）ＮＡＬユニットと、補足拡張情報（ＳＥＩ）メッセージＮＡＬユニットと、前記アクセスユニットのビデオレイヤと、前記アクセスユニットの異なるビュー構成要素と、複数のＮＡＬユニットからアグリゲートされたＮＵＬユニットとのうちの少なくとも１つを備える、〔１〕に記載の方法。
〔１０〕
前記複数のＮＡＬユニットが第１の複数のＮＡＬユニットを備え、前記方法が、前記符号化されたビデオデータに基づいて第２の複数のＮＡＬユニットを含む第３のトラックを構築することをさらに備え、前記第２の複数のＮＡＬユニットが前記アクセスユニットの一部を形成し、前記第２の複数のＮＡＬユニットが、前記エクストラクタによって識別された前記第２の識別されたＮＡＬユニットを備える、〔１〕に記載の方法。
〔１１〕
前記ビデオサンプルが第１のビデオサンプルを備え、前記複数のＮＡＬユニットが第１の複数のＮＡＬユニットを備え、前記第１のトラックが、第２の複数のＮＡＬユニットを備える第２のサンプルをさらに備え、前記アクセスユニットが前記第２のサンプルを備え、前記第２の複数のＮＡＬユニットが、前記エクストラクタによって識別された前記第２のＮＡＬユニットを備える、〔１〕に記載の方法。
〔１２〕
前記第２のＮＡＬユニットが、前記ビデオサンプル中の前記第１の識別されたＮＡＬユニットから少なくとも１バイトのデータだけ分離された、前記第１のトラックの前記ビデオサンプルの前記複数のＮＡＬユニットのうちの第２のＮＡＬユニットを備える、〔１〕に記載の方法。
〔１３〕
各トラックの特性に基づいて、宛先デバイスによって復号するために前記第１のトラックまたは前記第２のトラックのいずれかが選択可能であるように、前記第１のトラックと前記第２のトラックとがスイッチグループを形成する、〔１〕に記載の方法。
〔１４〕
前記第２トラックを構築することが、
前記第２のトラックのフレームレートをシグナリングすることと、
前記第２のトラックのための前記第１のトラックの前記ビデオサンプルの時間識別子をシグナリングすることとを備え、
前記第２のトラックが２つ以上のビューを備えるとき、前記第２のトラックを構築することが、
前記第２のトラックを復号した後に表示されるべきビューの数を表す値をシグナリングすることと、
前記第２のトラックのための表示されるべきビューを表す１つまたは複数のビュー識別子値をシグナリングすることと、
前記第２のトラックのための復号されるべきビューの数を表す値をシグナリングすることと
をさらに備える、〔１３〕に記載の方法。
〔１５〕
ビデオデータを符号化するための装置であって、前記装置が、
ビデオデータを符号化するように構成されたエンコーダと、
前記符号化されたビデオデータに基づいて、複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含む第１のトラックを構築することであって、前記ビデオサンプルがアクセスユニット中に含まれる、第１のトラックを構築することと、前記第１のトラックの前記ビデオサンプル中の前記複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含む第２のトラックを構築することであって、前記複数のＮＡＬユニットのうちの前記少なくとも１つが第１の識別されたＮＡＬユニットを備え、前記エクストラクタが前記アクセスユニットの第２のＮＡＬユニットを識別し、前記第１の識別されたＮＡＬユニットと前記第２の識別されたＮＡＬユニットとが非連続である、第２のトラックを構築することと、前記第１のトラックと前記第２のトラックとを、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイル中に含めることとを行うように構成されたマルチプレクサと、
前記ビデオファイルを出力するように構成された出力インターフェースと
を備える、装置。
〔１６〕
前記ビデオファイルが、ＩＳＯベースメディアファイルフォーマットと、スケーラブルビデオコーディング（ＳＶＣ）ファイルフォーマットと、アドバンストビデオコーディング（ＡＶＣ）ファイルフォーマットと、第３世代パートナーシッププロジェクト（３ＧＰＰ）ファイルフォーマットと、マルチビュービデオコーディング（ＭＶＣ）ファイルフォーマットとのうちの少なくとも１つに準拠する、〔１５〕に記載の装置。
〔１７〕
前記マルチプレクサが、前記符号化されたビデオデータに基づいて、前記第１のトラック中に含まれていない１つまたは複数のＮＡＬユニットを前記第２のトラック中に含めるように構成された、〔１５〕に記載の装置。
〔１８〕
前記マルチプレクサが、前記第１のトラックの前記複数のＮＡＬユニットのうちの１つまたは複数を識別する第１のエクストラクタと、前記第２のトラックの前記複数のＮＡＬユニットのうちの１つまたは複数を識別する第２のエクストラクタとを含む第３のトラックを構築するように構成された、〔１７〕に記載の装置。
〔１９〕
前記エクストラクタが第１のエクストラクタを備え、前記マルチプレクサが、前記符号化されたビデオデータに基づいて、複数のＮＡＬユニットを含む第３のエクストラクタトラックを構築するように構成され、前記マルチプレクサが、前記第３のトラックの前記複数のＮＡＬユニットのうちの１つまたは複数を識別する第２のエクストラクタを含むように前記第２のトラックを構築するように構成された、〔１５〕に記載の装置。
〔２０〕
前記装置が、
集積回路と、
マイクロプロセッサと、
前記ビデオエンコーダと前記マルチプレクサとを含むワイヤレス通信デバイスと
のうちの少なくとも１つを備える、〔１６〕に記載の装置。
〔２１〕
ビデオデータを符号化するための装置であって、前記装置が、
符号化されたビデオデータに基づいて、複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含む第１のトラックを構築するための手段であって、前記ビデオサンプルがアクセスユニット中に含まれる、第１のトラックを構築するための手段と、
前記第１のトラックの前記ビデオサンプル中の前記複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含む第２のトラックを構築するための手段であって、前記複数のＮＡＬユニットのうちの前記少なくとも１つが第１の識別されたＮＡＬユニットを備え、前記エクストラクタが前記アクセスユニットの第２のＮＡＬユニットを識別し、前記第１の識別されたＮＡＬユニットと前記第２の識別されたＮＡＬユニットとが非連続である、第２のトラックを構築するための手段と、
前記第１のトラックと前記第２のトラックとを、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイル中に含めるための手段と、
前記ビデオファイルを出力するための手段と
を備える、装置。
〔２２〕
前記ビデオファイルが、ＩＳＯベースメディアファイルフォーマットと、スケーラブルビデオコーディング（ＳＶＣ）ファイルフォーマットと、アドバンストビデオコーディング（ＡＶＣ）ファイルフォーマットと、第３世代パートナーシッププロジェクト（３ＧＰＰ）ファイルフォーマットと、マルチビュービデオコーディング（ＭＶＣ）ファイルフォーマットとのうちの少なくとも１つに準拠する、〔２１〕に記載の装置。
〔２３〕
前記符号化されたデータに基づいて、前記第１のトラック中に含まれていない１つまたは複数のＮＡＬユニットを前記第２のトラック中に含めるための手段をさらに備える、〔２１〕に記載の装置。
〔２４〕
前記第１のトラックの前記複数のＮＡＬユニットのうちの１つまたは複数を識別する第１のエクストラクタと、前記第２のトラックの前記１つまたは複数のＮＡＬユニットのうちの少なくとも１つを識別する第２のエクストラクタとを含む第３のトラックを構築するための手段をさらに備える、〔２３〕に記載の装置。
〔２５〕
前記エクストラクタが第１のエクストラクタを備え、前記装置が、前記符号化されたビデオデータに基づいて、複数のＮＡＬユニットを含む第３のエクストラクタトラックを構築するための手段をさらに備え、前記第２のトラックを構築するための前記手段が、前記第３のトラックの前記複数のＮＡＬユニットのうちの１つまたは複数を識別する第２のエクストラクタを含むように前記第２のトラックを構築するための手段を備える、〔２１〕に記載の装置。
〔２６〕
実行されると、
符号化されたビデオデータに基づいて、複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含む第１のトラックを構築することであって、前記ビデオサンプルがアクセスユニット中に含まれる、第１のトラックを構築することと、
前記第１のトラックの前記ビデオサンプル中の前記複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含む第２のトラックを構築することであって、前記複数のＮＡＬユニットのうちの前記少なくとも１つが第１の識別されたＮＡＬユニットを備え、前記エクストラクタが前記アクセスユニットの第２のＮＡＬユニットを識別し、前記第１の識別されたＮＡＬユニットと前記第２の識別されたＮＡＬユニットとが非連続である、第２のトラックを構築することと、
前記第１のトラックと前記第２のトラックとを、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイル中に含めることと、
前記ビデオファイルを出力することと
をプロセッサに行わせる命令を備える、コンピュータ可読記憶媒体。
〔２７〕
前記ビデオファイルが、ＩＳＯベースメディアファイルフォーマットと、スケーラブルビデオコーディング（ＳＶＣ）ファイルフォーマットと、アドバンストビデオコーディング（ＡＶＣ）ファイルフォーマットと、第３世代パートナーシッププロジェクト（３ＧＰＰ）ファイルフォーマットと、マルチビュービデオコーディング（ＭＶＣ）ファイルフォーマットとのうちの少なくとも１つに準拠する、〔２６〕に記載のコンピュータ可読記憶媒体。
〔２８〕
前記符号化されたデータに基づいて、前記第１のトラック中に含まれていない１つまたは複数のＮＡＬユニットを前記第２のトラック中に含めることを前記プロセッサに行わせる命令をさらに備える、〔２６〕に記載のコンピュータ可読記憶媒体。
〔２９〕
前記第１のトラックの前記複数のＮＡＬユニットのうちの１つまたは複数を識別する第１のエクストラクタと、前記第２のトラックの前記１つまたは複数のＮＡＬユニットのうちの少なくとも１つを識別する第２のエクストラクタとを含む第３のトラックを構築することを前記プロセッサに行わせる命令をさらに備える、〔２８〕に記載のコンピュータ可読記憶媒体。
〔３０〕
前記エクストラクタが第１のエクストラクタを備え、前記コンピュータ可読記憶媒体が、前記符号化されたビデオデータに基づいて、複数のＮＡＬユニットを含む第３のエクストラクタトラックを構築することを前記プロセッサに行わせる命令をさらに備え、前記第２のトラックを構築することを前記プロセッサに行わせる前記命令が、前記第３のトラックの前記複数のＮＡＬユニットのうちの１つまたは複数を識別する第２のエクストラクタを含むように前記第２のトラックを構築することを前記プロセッサに行わせる命令を備える、〔２６〕に記載のコンピュータ可読記憶媒体。
〔３１〕
ビデオデータを復号するための方法であって、前記方法が、
宛先デバイスのデマルチプレクサによって、国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイルを受信することであって、前記ビデオファイルが第１のトラックと第２のトラックとを備え、前記第１のトラックが、符号化されたビデオデータに対応する複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含み、前記ビデオサンプルがアクセスユニット中に含まれ、前記第２のトラックが、前記第１のトラックの前記複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含み、前記複数のＮＡＬユニットのうちの前記少なくとも１つが第１の識別されたＮＡＬユニットを備え、前記エクストラクタが前記アクセスユニットの第２のＮＡＬユニットを識別し、前記第１の識別されたＮＡＬユニットと前記第２の識別されたＮＡＬユニットとが非連続である、受信することと、
復号されるべき前記第２のトラックを選択することと、
前記第２のトラックの前記エクストラクタによって識別された前記第１のＮＡＬユニットおよび前記第２のＮＡＬユニットの符号化されたビデオデータを前記宛先デバイスのビデオデコーダに送ることと
を備える、方法。
〔３２〕
前記第２のトラックの前記エクストラクタによって識別されない前記第１のトラックの前記複数のＮＡＬユニットの各々を廃棄することをさらに備える、〔３１〕に記載の方法。
〔３３〕
前記第２のトラックが、前記第１のトラック中に含まれていない１つまたは複数のＮＡＬユニットをさらに備え、前記方法が、前記第２のトラックの前記１つまたは複数のＮＡＬユニットの符号化されたビデオデータを前記ビデオデコーダに送ることをさらに備える、〔３１〕に記載の方法。
〔３４〕
前記ビデオファイルが、符号化されたビデオデータに対応する複数のＮＡＬユニットを含む第３のトラックをさらに備え、前記方法が、前記第３のトラックの前記複数のＮＡＬユニットの符号化されたビデオデータを前記ビデオデコーダに送ることをさらに備える、〔３１〕に記載の方法。
〔３５〕
ビデオデータを復号するための装置であって、前記装置が、
ビデオデータを復号するように構成されたビデオデコーダと、
国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイルを受信することであって、前記ビデオファイルが第１のトラックと第２のトラックとを備え、前記第１のトラックが、符号化されたビデオデータに対応する複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含み、前記ビデオサンプルがアクセスユニット中に含まれ、前記第２のトラックが、前記第１のトラックの前記複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含み、前記複数のＮＡＬユニットのうちの前記少なくとも１つが第１の識別されたＮＡＬユニットを備え、前記エクストラクタが前記アクセスユニットの第２のＮＡＬユニットを識別し、前記第１の識別されたＮＡＬユニットと前記第２の識別されたＮＡＬユニットとが非連続である、受信することと、復号されるべき前記第２のトラックを選択することと、前記第２のトラックの前記エクストラクタによって識別された前記第１のＮＡＬユニットおよび前記第２のＮＡＬユニットの符号化されたビデオデータを前記ビデオデコーダに送ることとを行うように構成されたデマルチプレクサとを備える、装置。
〔３６〕
前記デマルチプレクサが、前記第２のトラックの前記エクストラクタによって識別されない前記第１のトラックの前記複数のＮＡＬユニットの各々を廃棄するように構成された、〔３５〕に記載の装置。
〔３７〕
前記第２のトラックが、前記第１のトラック中に含まれていない１つまたは複数のＮＡＬユニットをさらに備え、前記デマルチプレクサが、前記第２のトラックの前記１つまたは複数のＮＡＬユニットの符号化されたビデオデータを前記ビデオデコーダに送るように構成された、〔３５〕に記載の装置。
〔３８〕
前記ビデオファイルが、符号化されたビデオデータに対応する複数のＮＡＬユニットを含む第３のトラックをさらに備え、前記デマルチプレクサが、前記第３のトラックの前記複数のＮＡＬユニットの符号化されたビデオデータを前記ビデオデコーダに送るように構成された、〔３５〕に記載の装置。
〔３９〕
ビデオデータを復号するための装置であって、前記装置が、
国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイルを受信するための手段であって、前記ビデオファイルが第１のトラックと第２のトラックとを備え、前記第１のトラックが、符号化されたビデオデータに対応する複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含み、前記ビデオサンプルがアクセスユニット中に含まれ、前記第２のトラックが、前記第１のトラックの前記複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含み、前記複数のＮＡＬユニットのうちの前記少なくとも１つが第１の識別されたＮＡＬユニットを備え、前記エクストラクタが前記アクセスユニットの第２のＮＡＬユニットを識別し、前記第１の識別されたＮＡＬユニットと前記第２の識別されたＮＡＬユニットとが非連続である、受信するための手段と、
復号されるべき前記第２のトラックを選択するための手段と、
前記第２のトラックの前記エクストラクタによって識別された前記第１のＮＡＬユニットおよび前記第２のＮＡＬユニットの符号化されたビデオデータを前記装置のビデオデコーダに送るための手段と
を備える、装置。
〔４０〕
前記第２のトラックの前記エクストラクタによって識別されない前記第１のトラックの前記複数のＮＡＬユニットの各々を廃棄するための手段をさらに備える、〔３９〕に記載の装置。
〔４１〕
前記第２のトラックが、前記第１のトラック中に含まれていない１つまたは複数のＮＡＬユニットをさらに備え、前記装置が、前記第２のトラックの前記１つまたは複数のＮＡＬユニットの符号化されたビデオデータを前記ビデオデコーダに送るための手段をさらに備える、〔３９〕に記載の装置。
〔４２〕
前記ビデオファイルが、符号化されたビデオデータに対応する複数のＮＡＬユニットを含む第３のトラックをさらに備え、前記装置が、前記第３のトラックの前記複数のＮＡＬユニットの符号化されたビデオデータを前記ビデオデコーダに送るための手段をさらに備える、〔３９〕に記載の装置。
〔４３〕
実行されると、
国際標準化機構（ＩＳＯ）ベースメディアファイルフォーマットに少なくとも部分的に準拠するビデオファイルを受信したとき、復号されるべき前記第２のトラックを選択することであって、前記ビデオファイルが第１のトラックと第２のトラックとを備え、前記第１のトラックが、符号化されたビデオデータに対応する複数のネットワークアクセスレイヤ（ＮＡＬ）ユニットを備えるビデオサンプルを含み、前記ビデオサンプルがアクセスユニット中に含まれ、前記第２のトラックが、前記第１のトラックの前記複数のＮＡＬユニットのうちの少なくとも１つを識別するエクストラクタを含み、前記複数のＮＡＬユニットのうちの前記少なくとも１つが第１の識別されたＮＡＬユニットを備え、前記エクストラクタが前記アクセスユニットの第２のＮＡＬユニットを識別し、前記第１の識別されたＮＡＬユニットと前記第２の識別されたＮＡＬユニットとが非連続である、前記第２のトラックを選択することと、
前記第２のトラックの前記エクストラクタによって識別された前記第１のＮＡＬユニットおよび前記第２のＮＡＬユニットの符号化されたビデオデータをビデオデコーダに送ることと
をプロセッサに行わせる命令を備えるコンピュータ可読記憶媒体。
〔４４〕
前記第２のトラックの前記エクストラクタによって識別されない前記第１のトラックの前記複数のＮＡＬユニットの各々を廃棄することをさらに備える、〔４３〕に記載のコンピュータ可読記憶媒体。
〔４５〕
前記第２のトラックが、前記第１のトラック中に含まれていない１つまたは複数のＮＡＬユニットをさらに備え、前記方法が、前記第２のトラックの前記１つまたは複数のＮＡＬユニットの符号化されたビデオデータを前記ビデオデコーダに送ることをさらに備える、〔４３〕に記載のコンピュータ可読記憶媒体。
〔４６〕
前記ビデオファイルが、符号化されたビデオデータに対応する複数のＮＡＬユニットを含む第３のトラックをさらに備え、前記方法が、前記第３のトラックの前記複数のＮＡＬユニットの符号化されたビデオデータを前記ビデオデコーダに送ることをさらに備える、〔４３〕に記載のコンピュータ可読記憶媒体。
Various examples have been described. These and other examples are within the scope of the following claims.
Hereinafter, the invention described in the scope of claims of the present application will be appended.
[1]
A method for encoding video data, said method comprising:
Constructing, by a source video device, a first track comprising video samples comprising a plurality of network access layer (NAL) units based on the encoded video data, wherein the video samples are in the access units Building a first track included;
Constructing, by the source video device, a second track that includes an extractor that identifies at least one of the plurality of NAL units in the video samples of the first track, the plurality of the plurality of NAL units; The at least one of the NAL units comprises a first identified NAL unit, wherein the extractor identifies a second NAL unit of the access unit, and the first identified NAL unit and the second Constructing a second track that is discontinuous with the identified NAL units;
Including the first track and the second track in a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format;
Outputting the video file;
A method comprising:
[2]
The method according to [1], wherein the video file conforms to the ISO base media file format.
[3]
The video file includes a scalable video coding (SVC) file format, an advanced video coding (AVC) file format, a third generation partnership project (3GPP) file format, and a multi-view video coding (MVC) file format. The method according to [1], which conforms to at least one.
[4]
Constructing the second track may include one or more additional NAL units not included in the plurality of NAL units of the first track based on the encoded data. The method according to [1], further comprising inclusion in two tracks.
[5]
Identifying a first extractor identifying one or more of the plurality of NAL units of the first track and at least one of the one or more NAL units of the second track; The method according to [4], further comprising: constructing a third track including the second extractor.
[6]
Building the third track further comprises including in the third track one or more NAL units not included in the first track and the second track; [5].
[7]
Constructing the second track comprises constructing the extractor to identify each of the plurality of NAL units of the video samples of the first track, wherein the extractor includes a destination device The method according to [1], wherein each of the plurality of NAL units of the video sample is extracted as a whole.
[8]
Building the second track includes specifying the one or more byte ranges of the plurality of NAL units of the plurality of NAL units of the video sample in the first track of the video file. The method of [1], comprising constructing the extractor to identify the one or more of the plurality of NAL units of a sample.
[9]
The plurality of NAL units of the video sample in the first track include a slice of a common picture, a non-video coding layer (VCL) NAL unit, a supplemental enhancement information (SEI) message NAL unit, and the access unit The method of [1], comprising at least one of a video layer, a different view component of the access unit, and a NUL unit aggregated from a plurality of NAL units.
[10]
The plurality of NAL units comprises a first plurality of NAL units, and the method further comprises constructing a third track including a second plurality of NAL units based on the encoded video data. The second plurality of NAL units form part of the access unit, the second plurality of NAL units comprising the second identified NAL unit identified by the extractor; [1].
[11]
The video sample comprises a first video sample, the plurality of NAL units comprises a first plurality of NAL units, and the first track further comprises a second sample comprising a second plurality of NAL units. The method of [1], wherein the access unit comprises the second sample, and the second plurality of NAL units comprises the second NAL unit identified by the extractor.
[12]
Among the plurality of NAL units of the video sample of the first track, wherein the second NAL unit is separated from the first identified NAL unit in the video sample by at least one byte of data. The method according to [1], comprising the second NAL unit.
[13]
Based on the characteristics of each track, the first track and the second track are such that either the first track or the second track can be selected for decoding by a destination device. The method according to [1], wherein a switch group is formed.
[14]
Building the second track,
Signaling the frame rate of the second track;
Signaling a time identifier of the video sample of the first track for the second track;
Building the second track when the second track comprises more than one view;
Signaling a value representing the number of views to be displayed after decoding the second track;
Signaling one or more view identifier values representing a view to be displayed for the second track;
Signaling a value representing the number of views to be decoded for the second track;
The method according to [13], further comprising:
[15]
An apparatus for encoding video data, the apparatus comprising:
An encoder configured to encode video data;
Constructing a first track comprising video samples comprising a plurality of network access layer (NAL) units based on the encoded video data, wherein the video samples are included in an access unit; Building one track and building a second track including an extractor identifying at least one of the plurality of NAL units in the video samples of the first track, The at least one of the plurality of NAL units comprises a first identified NAL unit, the extractor identifying a second NAL unit of the access unit, and the first identified NAL unit; Construct a second track that is discontinuous with the second identified NAL unit And a multiplexer configured to include the first track and the second track in a video file that is at least partially compliant with an International Organization for Standardization (ISO) base media file format; ,
An output interface configured to output the video file;
An apparatus comprising:
[16]
The video file includes an ISO base media file format, a scalable video coding (SVC) file format, an advanced video coding (AVC) file format, a third generation partnership project (3GPP) file format, and a multi-view video coding (MVC). The device according to [15], which conforms to at least one of a file format.
[17]
The multiplexer is configured to include one or more NAL units in the second track that are not included in the first track based on the encoded video data. ] The apparatus of description.
[18]
A first extractor that identifies one or more of the plurality of NAL units of the first track; and one or more of the plurality of NAL units of the second track. The apparatus according to [17], wherein the apparatus is configured to construct a third track including a second extractor that identifies
[19]
The extractor comprises a first extractor, and the multiplexer is configured to construct a third extractor track including a plurality of NAL units based on the encoded video data; [15], wherein the second track is configured to include a second extractor that identifies one or more of the plurality of NAL units of the third track. Equipment.
[20]
The device is
An integrated circuit;
A microprocessor;
A wireless communication device including the video encoder and the multiplexer;
The apparatus according to [16], comprising at least one of the following.
[21]
An apparatus for encoding video data, the apparatus comprising:
Means for constructing a first track comprising video samples comprising a plurality of network access layer (NAL) units based on the encoded video data, wherein the video samples are included in the access units; Means for constructing the first track;
Means for constructing a second track that includes an extractor that identifies at least one of the plurality of NAL units in the video sample of the first track, the method comprising: The at least one of the plurality comprises a first identified NAL unit, and the extractor identifies a second NAL unit of the access unit, the first identified NAL unit and the second identified Means for constructing a second track that is non-contiguous with the NAL unit;
Means for including the first track and the second track in a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format;
Means for outputting the video file;
An apparatus comprising:
[22]
The video file includes an ISO base media file format, a scalable video coding (SVC) file format, an advanced video coding (AVC) file format, a third generation partnership project (3GPP) file format, and a multi-view video coding (MVC). The device according to [21], which conforms to at least one of a file format.
[23]
The method of [21], further comprising means for including, in the second track, one or more NAL units that are not included in the first track based on the encoded data. apparatus.
[24]
Identifying a first extractor identifying one or more of the plurality of NAL units of the first track and at least one of the one or more NAL units of the second track; The apparatus according to [23], further comprising means for constructing a third track including the second extractor.
[25]
The extractor comprises a first extractor, and the apparatus further comprises means for constructing a third extractor track comprising a plurality of NAL units based on the encoded video data; Building the second track such that the means for building a second track includes a second extractor that identifies one or more of the plurality of NAL units of the third track. The device according to [21], comprising means for
[26]
When executed
Constructing a first track including video samples comprising a plurality of network access layer (NAL) units based on the encoded video data, wherein the video samples are included in the access units; Building a track for
Constructing a second track including an extractor that identifies at least one of the plurality of NAL units in the video samples of the first track, wherein the plurality of NAL units At least one comprising a first identified NAL unit, wherein the extractor identifies a second NAL unit of the access unit, the first identified NAL unit and the second identified NAL unit Constructing a second track, wherein and are discontinuous;
Including the first track and the second track in a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format;
Outputting the video file;
A computer-readable storage medium comprising instructions for causing a processor to execute.
[27]
The video file includes an ISO base media file format, a scalable video coding (SVC) file format, an advanced video coding (AVC) file format, a third generation partnership project (3GPP) file format, and a multi-view video coding (MVC). The computer-readable storage medium according to [26], which conforms to at least one of file formats.
[28]
Further comprising instructions that cause the processor to include in the second track one or more NAL units not included in the first track based on the encoded data. 26].
[29]
Identifying a first extractor identifying one or more of the plurality of NAL units of the first track and at least one of the one or more NAL units of the second track; The computer-readable storage medium according to [28], further comprising instructions for causing the processor to construct a third track including the second extractor.
[30]
In the processor, the extractor comprises a first extractor, and the computer readable storage medium constructs a third extractor track including a plurality of NAL units based on the encoded video data. A second instruction that identifies one or more of the plurality of NAL units of the third track, wherein the instruction causes the processor to perform the building of the second track. [26] The computer-readable storage medium according to [26], comprising instructions for causing the processor to construct the second track so as to include an extractor.
[31]
A method for decoding video data, said method comprising:
Receiving a video file at least partially compliant with an International Organization for Standardization (ISO) base media file format by a demultiplexer of a destination device, the video file comprising a first track and a second track; The first track includes a video sample comprising a plurality of network access layer (NAL) units corresponding to the encoded video data, the video sample is included in the access unit, and the second track is An extractor that identifies at least one of the plurality of NAL units of the first track, wherein the at least one of the plurality of NAL units comprises a first identified NAL unit; An extractor is connected to the second of the access unit Identifies the AL unit, the first and identified NAL unit and the second identified NAL units are non-consecutive, and receiving,
Selecting the second track to be decoded;
Sending the encoded video data of the first NAL unit and the second NAL unit identified by the extractor of the second track to a video decoder of the destination device;
A method comprising:
[32]
The method of [31], further comprising discarding each of the plurality of NAL units of the first track that is not identified by the extractor of the second track.
[33]
The second track further comprises one or more NAL units not included in the first track, and the method encodes the one or more NAL units of the second track. The method according to [31], further comprising: sending the processed video data to the video decoder.
[34]
The video file further comprises a third track including a plurality of NAL units corresponding to the encoded video data, and the method includes the encoded video data of the plurality of NAL units of the third track. The method of [31], further comprising: sending to the video decoder.
[35]
An apparatus for decoding video data, the apparatus comprising:
A video decoder configured to decode the video data;
Receiving a video file that is at least partially compliant with an International Organization for Standardization (ISO) base media file format, the video file comprising a first track and a second track, wherein the first track is , Comprising a video sample comprising a plurality of network access layer (NAL) units corresponding to the encoded video data, wherein the video sample is included in the access unit, and wherein the second track is the first track An extractor that identifies at least one of the plurality of NAL units, wherein the at least one of the plurality of NAL units comprises a first identified NAL unit, wherein the extractor includes the access unit A second NAL unit is identified and said first identified Receiving a non-contiguous NAL unit and the second identified NAL unit, selecting the second track to be decoded, and by the extractor of the second track And a demultiplexer configured to send the encoded video data of the identified first NAL unit and the second NAL unit to the video decoder.
[36]
The apparatus of [35], wherein the demultiplexer is configured to discard each of the plurality of NAL units of the first track that is not identified by the extractor of the second track.
[37]
The second track further comprises one or more NAL units not included in the first track, and the demultiplexer includes a code of the one or more NAL units of the second track. [35] The apparatus according to [35], configured to send the converted video data to the video decoder.
[38]
The video file further comprises a third track including a plurality of NAL units corresponding to encoded video data, and the demultiplexer encodes the plurality of NAL units of the third track. The apparatus of [35], configured to send data to the video decoder.
[39]
An apparatus for decoding video data, the apparatus comprising:
Means for receiving a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format, the video file comprising a first track and a second track, wherein the first track A track includes a video sample comprising a plurality of network access layer (NAL) units corresponding to encoded video data, the video sample is included in the access unit, and the second track is the first An extractor that identifies at least one of the plurality of NAL units of a track, wherein the at least one of the plurality of NAL units comprises a first identified NAL unit, wherein the extractor includes the access A second NAL unit of the unit is identified and said first NAL unit Another has been NAL unit and the second identified NAL units are non-consecutive, and means for receiving,
Means for selecting the second track to be decoded;
Means for sending the encoded video data of the first NAL unit and the second NAL unit identified by the extractor of the second track to a video decoder of the device;
An apparatus comprising:
[40]
The apparatus of [39], further comprising means for discarding each of the plurality of NAL units of the first track that is not identified by the extractor of the second track.
[41]
The second track further comprises one or more NAL units not included in the first track, and the apparatus encodes the one or more NAL units of the second track. [39] The apparatus of [39], further comprising means for sending the processed video data to the video decoder.
[42]
The video file further comprises a third track including a plurality of NAL units corresponding to the encoded video data, and the apparatus encodes the encoded video data of the plurality of NAL units of the third track. The apparatus of [39], further comprising means for sending to the video decoder.
[43]
When executed
When receiving a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format, selecting the second track to be decoded, wherein the video file is A second track, wherein the first track comprises a video sample comprising a plurality of network access layer (NAL) units corresponding to the encoded video data, the video sample being included in the access unit The second track includes an extractor that identifies at least one of the plurality of NAL units of the first track, wherein the at least one of the plurality of NAL units is first identified. The NAL unit, and the extractor is the access unit And to identify a second NAL unit, and the first identified NAL unit and the second identified NAL units are non-consecutive, selecting the second track,
Sending the encoded video data of the first NAL unit and the second NAL unit identified by the extractor of the second track to a video decoder;
A computer-readable storage medium comprising instructions for causing a processor to execute.
[44]
[43] The computer readable storage medium of [43], further comprising discarding each of the plurality of NAL units of the first track not identified by the extractor of the second track.
[45]
The second track further comprises one or more NAL units not included in the first track, and the method encodes the one or more NAL units of the second track. The computer-readable storage medium according to [43], further comprising sending the processed video data to the video decoder.
[46]
The video file further comprises a third track including a plurality of NAL units corresponding to the encoded video data, and the method includes the encoded video data of the plurality of NAL units of the third track. The computer-readable storage medium according to [43], further comprising: sending to the video decoder.

Claims

A method for encoding video data, said method comprising:
Constructing, by a source video device, a first track comprising video samples comprising a plurality of network access layer (NAL) units based on the encoded video data, wherein the video samples are in the access units Building a first track included;
Constructing a second track including a plurality of extractors by the source video device, the plurality of extractors including an extractor identifying a plurality of NAL units of the first track; The plurality of identified NAL units includes a first identified NAL unit of the NAL units in the video sample of the first track and a second identified NAL unit of the access unit. , wherein Ri said first identified NAL unit and the second identified NAL units and non-contiguous der, the extractor identifies the first NAL unit and the second NAL unit and constructing a second track,
Including the first track and the second track in a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format;
Outputting the video file.

The method of claim 1, wherein the video file is compliant with the ISO base media file format.

The video file includes a scalable video coding (SVC) file format, an advanced video coding (AVC) file format, a third generation partnership project (3GPP) file format, and a multi-view video coding (MVC) file format. The method of claim 1, wherein the method is compliant with at least one.

Constructing the second track may include one or more additional NAL units not included in the plurality of NAL units of the first track based on the encoded data. The method of claim 1, further comprising including in two tracks.

Identifying a first extractor identifying one or more of the plurality of NAL units of the first track and at least one of the one or more NAL units of the second track; 5. The method of claim 4, further comprising constructing a third track that includes a second extractor.

The building of the third track further comprises including in the third track one or more NAL units not included in the first track and the second track. Item 6. The method according to Item 5.

Constructing the second track comprises constructing the extractor to identify each of the plurality of NAL units of the video samples of the first track, wherein the extractor includes a destination device The method of claim 1, further comprising: extracting each of the plurality of NAL units of the video sample as a whole.

Building the second track includes specifying the one or more byte ranges of the plurality of NAL units of the plurality of NAL units of the video sample in the first track of the video file. The method of claim 1, comprising constructing the extractor to identify the one or more of the plurality of NAL units of a sample.

The plurality of NAL units of the video samples in the first track include a slice of a common picture, a non-video coding layer (VCL) NAL unit, a supplemental enhancement information (SEI) message NAL unit, and the access unit The method of claim 1, comprising at least one of a video layer, a different view component of the access unit, and a NAL unit aggregated from a plurality of NAL units.

The plurality of NAL units comprises a first plurality of NAL units, and the method further comprises constructing a third track including a second plurality of NAL units based on the encoded video data. The second plurality of NAL units form part of the access unit, and the second plurality of NAL units comprises the second identified NAL unit identified by the extractor. Item 2. The method according to Item 1.

The video sample comprises a first video sample, the plurality of NAL units comprises a first plurality of NAL units, and the first track further comprises a second sample comprising a second plurality of NAL units. The method of claim 1, wherein the access unit comprises the second sample, and the second plurality of NAL units comprises the second NAL unit identified by the extractor.

Among the plurality of NAL units of the video sample of the first track, wherein the second NAL unit is separated from the first identified NAL unit in the video sample by at least one byte of data. The method of claim 1 comprising a second NAL unit.

Based on the characteristics of each track, the first track and the second track are such that either the first track or the second track can be selected for decoding by a destination device. The method of claim 1, wherein a switch group is formed.

Building the second track,
Signaling the frame rate of the second track;
Signaling a time identifier of the video sample of the first track for the second track;
Building the second track when the second track comprises more than one view;
Signaling a value representing the number of views to be displayed after decoding the second track;
And to signal one or more views identifier value for the display view to be for the second track,
14. The method of claim 13, further comprising signaling a value representing a number of views to be decoded for the second track.

An apparatus for encoding video data, the apparatus comprising:
An encoder configured to encode video data;
Constructing a first track comprising video samples comprising a plurality of network access layer (NAL) units based on the encoded video data, wherein the video samples are included in an access unit; Constructing one track and constructing a second track including a plurality of extractors, wherein the plurality of extractors includes an extractor that identifies a plurality of NAL units of the first track. The plurality of identified NAL units includes: a first identified NAL unit of the NAL units in the video sample of the first track; and a second identified NAL of the access unit. It includes a unit, wherein the first identified NAL unit and the second identification Ri NAL units and non-contiguous der that the extractor identifies the first NAL unit and the second NAL unit, and configured multiplexer to build the second track,
An output interface configured to output the video file, wherein the multiplexer at least partially converts the first track and the second track to an International Organization for Standardization (ISO) base media file format. An apparatus further configured to perform inclusion in a compliant video file .

The video file includes an ISO base media file format, a scalable video coding (SVC) file format, an advanced video coding (AVC) file format, a third generation partnership project (3GPP) file format, and a multi-view video coding (MVC). 16. The apparatus of claim 15, wherein the apparatus conforms to at least one of a file format.

The multiplexer is configured to include one or more NAL units in the second track that are not included in the first track based on the encoded video data. 15. The apparatus according to 15.

A first extractor that identifies one or more of the plurality of NAL units of the first track; and one or more of the plurality of NAL units of the second track. The apparatus of claim 17, wherein the apparatus is configured to construct a third track including a second extractor that identifies the second extractor.

The extractor comprises a first extractor, and the multiplexer is configured to build a third extractor track including a plurality of NAL units based on the encoded video data, the multiplexer 16. The second track is configured to include a second extractor that identifies one or more of the plurality of NAL units of the third track. Equipment.

The device is
An integrated circuit;
A microprocessor;
The apparatus of claim 16, comprising at least one of a wireless communication device including the video encoder and the multiplexer.

An apparatus for encoding video data, the apparatus comprising:
Means for constructing a first track comprising video samples comprising a plurality of network access layer (NAL) units based on the encoded video data, wherein the video samples are included in the access units; Means for constructing the first track;
And means for constructing the second track including a plurality of extractors, a plurality of the plurality of extractor which comprises extractor identifying a plurality of NAL units of the first track, which is the identification NAL units of the first track include a first identified NAL unit of the NAL units in the video sample of the first track and a second identified NAL unit of the access unit, wherein wherein Ri first identified NAL unit and the second identified NAL units and non-contiguous der, the extractor identifies the first NAL unit and the second NAL unit, the second Means for building a track;
Means for including the first track and the second track in a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format;
Means for outputting the video file.

The video file includes an ISO base media file format, a scalable video coding (SVC) file format, an advanced video coding (AVC) file format, a third generation partnership project (3GPP) file format, and a multi-view video coding (MVC). 23. The apparatus of claim 21, wherein the apparatus conforms to at least one of a file format.

23. The means of claim 21, further comprising means for including in the second track one or more NAL units not included in the first track based on the encoded data. apparatus.

Identifying a first extractor identifying one or more of the plurality of NAL units of the first track and at least one of the one or more NAL units of the second track; 24. The apparatus of claim 23, further comprising means for building a third track that includes a second extractor.

The extractor comprises a first extractor, and the apparatus further comprises means for constructing a third extractor track comprising a plurality of NAL units based on the encoded video data; Building the second track such that the means for building a second track includes a second extractor that identifies one or more of the plurality of NAL units of the third track. The apparatus of claim 21, comprising means for:

When executed
Constructing a first track including video samples comprising a plurality of network access layer (NAL) units based on the encoded video data, wherein the video samples are included in the access units; Building a track for
The method comprising: constructing a second track including a plurality of extractors, the plurality of extractors includes extractor identifying a plurality of NAL units of the first track, a plurality of NAL said identified A unit includes a first identified NAL unit of the NAL units in the video sample of the first track and a second identified NAL unit of the access unit, wherein the first 1 of the identified NAL unit and the second identified NAL units and non-contiguous der is, the extractor identifies the first NAL unit and the second NAL unit, the second track Building,
Including the first track and the second track in a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format;
A computer readable storage medium comprising instructions for causing a processor to output the video file.

The video file includes an ISO base media file format, a scalable video coding (SVC) file format, an advanced video coding (AVC) file format, a third generation partnership project (3GPP) file format, and a multi-view video coding (MVC). 27. The computer readable storage medium of claim 26, compliant with at least one of file format.

And further comprising instructions for causing the processor to include one or more NAL units in the second track that are not included in the first track based on the encoded data. Item 27. The computer-readable storage medium according to Item 26.

Identifying a first extractor identifying one or more of the plurality of NAL units of the first track and at least one of the one or more NAL units of the second track; 29. The computer readable storage medium of claim 28, further comprising instructions that cause the processor to construct a third track that includes a second extractor that is configured.

In the processor, the extractor comprises a first extractor, and the computer readable storage medium constructs a third extractor track including a plurality of NAL units based on the encoded video data. A second instruction that identifies one or more of the plurality of NAL units of the third track, wherein the instruction causes the processor to perform the building of the second track. 27. The computer readable storage medium of claim 26, comprising instructions that cause the processor to construct the second track to include an extractor.

A method for decoding video data, said method comprising:
Receiving a video file at least partially compliant with an International Organization for Standardization (ISO) base media file format by a demultiplexer of a destination device, the video file comprising a first track and a second track; The first track includes a video sample comprising a plurality of network access layer (NAL) units corresponding to the encoded video data, the video sample is included in the access unit, and the second track is includes a plurality of extractors, the plurality of extractors includes extractor identifying a plurality of NAL units of the first track, the identified plurality of NAL units, said first track The first identified NAL unit of the NAL unit It includes a dot and a second identified NAL unit of the access unit, wherein the, Ri said first identified NAL unit and the second identified NAL units and non-contiguous der, wherein The extractor receives , identifying the first NAL unit and the second NAL unit ;
Selecting the second track to be decoded;
Sending the encoded video data of the first NAL unit and the second NAL unit identified by the extractor of the second track to a video decoder of the destination device.

32. The method of claim 31, further comprising discarding each of the plurality of NAL units of the first track that is not identified by the extractor of the second track.

The second track further comprises one or more NAL units not included in the first track, and the method encodes the one or more NAL units of the second track. 32. The method of claim 31, further comprising sending processed video data to the video decoder.

The video file further comprises a third track including a plurality of NAL units corresponding to the encoded video data, and the method includes the encoded video data of the plurality of NAL units of the third track. 32. The method of claim 31, further comprising: sending to the video decoder.

An apparatus for decoding video data, the apparatus comprising:
A video decoder configured to decode the video data;
Receiving a video file that is at least partially compliant with an International Organization for Standardization (ISO) base media file format, the video file comprising a first track and a second track, wherein the first track is , Comprising video samples comprising a plurality of network access layer (NAL) units corresponding to the encoded video data, the video samples being included in the access units, and the second track comprising a plurality of extractors The plurality of extractors includes an extractor that identifies a plurality of NAL units of the first track, and the plurality of identified NAL units is a first of the NAL units of the first track. Identified NAL units and a second identifier of the access unit. It includes been NAL unit, wherein said first identified NAL unit and the second identified NAL unit Ri discontinuous der, the extractor, first NAL unit and the second A demultiplexer configured to receive , identifying the NAL unit , wherein the demultiplexer selects the second track to be decoded and by the extractor of the second track. the encoded video data of the identified first NAL unit and the second NAL unit Ru is further configured to perform the sending to the video decoder apparatus.

36. The apparatus of claim 35, wherein the demultiplexer is configured to discard each of the plurality of NAL units of the first track that is not identified by the extractor of the second track.

The second track further comprises one or more NAL units not included in the first track, and the demultiplexer includes a code of the one or more NAL units of the second track. 36. The apparatus of claim 35, configured to send digitized video data to the video decoder.

The video file further comprises a third track including a plurality of NAL units corresponding to encoded video data, and the demultiplexer encodes the plurality of NAL units of the third track. 36. The apparatus of claim 35, configured to send data to the video decoder.

An apparatus for decoding video data, the apparatus comprising:
Means for receiving a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format, the video file comprising a first track and a second track, wherein the first track The track includes a video sample comprising a plurality of network access layer (NAL) units corresponding to the encoded video data, the video sample is included in the access unit, and the second track is a plurality of extractors And the plurality of extractors includes an extractor that identifies a plurality of NAL units of the first track, and the plurality of identified NAL units are the NAL units of the first track. A first identified NAL unit and a first of the access units; Of including the identified NAL unit, wherein the first identified NAL unit and the second identified NAL units and non-contiguous der is, the extractor includes a first NAL unit Means for receiving , identifying a second NAL unit ;
Means for selecting the second track to be decoded;
Means for sending the encoded video data of the first NAL unit and the second NAL unit identified by the extractor of the second track to a video decoder of the apparatus.

40. The apparatus of claim 39, further comprising means for discarding each of the plurality of NAL units of the first track that is not identified by the extractor of the second track.

The second track further comprises one or more NAL units not included in the first track, and the apparatus encodes the one or more NAL units of the second track. 40. The apparatus of claim 39, further comprising means for sending processed video data to the video decoder.

The video file further comprises a third track including a plurality of NAL units corresponding to the encoded video data, and the apparatus encodes the encoded video data of the plurality of NAL units of the third track. 40. The apparatus of claim 39, further comprising means for sending a message to the video decoder.

When executed
When receiving a video file that conforms at least in part to an International Organization for Standardization (ISO) base media file format, selecting the second track to be decoded, wherein the video file is A second track, wherein the first track comprises a video sample comprising a plurality of network access layer (NAL) units corresponding to the encoded video data, the video sample being included in the access unit said second track comprises a plurality of extractors, the plurality of extractors includes extractor identifying a plurality of NAL units of the first track, the identified plurality of NAL units, The first identified of the NAL units of the first track Includes a AL unit, the second identified NAL unit of the access unit, wherein the, Ri said first identified NAL unit and the second identified NAL units and non-contiguous der, wherein The extractor selects the second track that identifies the first NAL unit and the second NAL unit ;
Computer readable comprising instructions for causing a processor to send encoded video data of the first NAL unit and the second NAL unit identified by the extractor of the second track to a video decoder. Storage medium.

44. The computer readable storage medium of claim 43, further comprising discarding each of the plurality of NAL units of the first track that is not identified by the extractor of the second track.

The second track further comprises one or more NAL units not included in the first track, and the method encodes the one or more NAL units of the second track. 44. The computer readable storage medium of claim 43, further comprising sending processed video data to the video decoder.

The video file further comprises a third track including a plurality of NAL units corresponding to the encoded video data, and the method includes the encoded video data of the plurality of NAL units of the third track. 44. The computer readable storage medium of claim 43, further comprising: sending to the video decoder.