JP6309463B2

JP6309463B2 - System and method for providing error resilience, random access, and rate control in scalable video communication

Info

Publication number: JP6309463B2
Application number: JP2015000263A
Authority: JP
Inventors: エレフセリアディス，アレクサンドロス; ホン，ダニー; シャピロ，オファー; ヴィーガント，トーマス
Original assignee: ヴィドヨ，インコーポレーテッド
Priority date: 2006-03-03
Filing date: 2015-01-05
Publication date: 2018-04-11
Anticipated expiration: 2027-03-05
Also published as: CN101421936A; CA2644753A1; JP2009540629A; CN101421936B; JP5753341B2; JP2015097416A

Description

関連出願の相互参照
本願は、2006年3月3日に出願した米国仮出願第60/778,760号、2006年3月29日に出願した米国仮出願第60/787,031号、および2006年10月23日に出願した米国仮出願第60/862,510号の利益を主張するものである。さらに、本願は、関連する国際特許出願第PCT/US06/28365号、国際特許出願第PCT/US06/028366号、国際特許出願第PCT/US06/028367号、国際特許出願第PCT/US06/028368号、国際特許出願第PCT/US06/061815号、国際特許出願第PCT/US06/62569号、および国際特許出願第PCT/US07/62357号、ならびに米国仮出願第60/884,148号、米国仮出願第60/786,997号、および米国仮出願第60/829,609号の利益を主張するものである。前述の優先権および本願の譲受人に譲渡される関連出願のすべてが、参照によってその全体を本願に組み込まれている。 Cross-reference to related applicationsThis application is a U.S. provisional application No. 60 / 778,760 filed on Mar. 3, 2006, U.S. Provisional Application No. 60 / 787,031 filed on Mar. 29, 2006, and Oct. 23, 2006. Claims the benefit of US Provisional Application No. 60 / 862,510, filed on the same day. Further, the present application is related to International Patent Application No. PCT / US06 / 28365, International Patent Application No.PCT / US06 / 028366, International Patent Application No.PCT / US06 / 028367, International Patent Application No.PCT / US06 / 028368. International Patent Application No. PCT / US06 / 061815, International Patent Application No.PCT / US06 / 62569, and International Patent Application No.PCT / US07 / 62357, and US Provisional Application No. 60 / 884,148, US Provisional Application No. 60. / 786,997, and US Provisional Application No. 60 / 829,609. All of the aforementioned priority and related applications that are assigned to the assignee of the present application are hereby incorporated by reference in their entirety.

本発明は、ビデオデータ通信システムに関する。本発明は、具体的には、スケーラブルビデオコーディング技法を利用するビデオ通信システムでエラー耐性、ランダムアクセス、およびレート制御機能を同時に提供することに関する。 The present invention relates to a video data communication system. The present invention specifically relates to simultaneously providing error resilience, random access, and rate control functions in a video communication system that utilizes scalable video coding techniques.

インターネットプロトコル(IP)に基づくネットワークなどのパケットベースのネットワークでのディジタルビデオの伝送は、少なくともデータトランスポートが通常はベストエフォート・ベースで行われるという事実により、極端にむずかしい。現代のパケットベースの通信システムでは、エラーは、通常、ビットエラーではなくパケット消失として現れる。さらに、そのようなパケット消失は、通常、中間のルータでの輻輳の結果であって、物理層エラーの結果ではない(これに対する1つの例外が、無線ネットワークおよびセルラネットワークである)。ビデオ信号の送信または受信でエラーが発生する時には、受信器がそのエラーからすばやく回復でき、着信ビデオ信号のエラーなしの表示に戻ることができることを保証することが重要である。しかし、通常のディジタルビデオ通信システムでは、受信器の頑健性は、帯域幅を節約するために着信データが強く圧縮されているという事実によって下げられる。さらに、通信システムで使用されるビデオ圧縮技法(たとえば、最新のコーデックであるITU-TのH.264およびH.263またはISOのMPEG-2コーデックおよびMPEG-4コーデック)は、シーケンシャルビデオパケットまたはシーケンシャルビデオフレームの間の非常に強い時間的依存性を産み出す可能性がある。具体的に言うと、動き補償予測(たとえば、PフレームまたはBフレームの使用を伴う)コーデックの使用は、表示されるフレームが過去のフレーム(1つまたは複数)に依存する、フレーム依存性の連鎖を産み出す。依存性の連鎖は、ビデオシーケンスの始めにまで及ぶ可能性がある。依存性の連鎖の結果として、特定パケットの消失は、受信器での複数の後続パケットの復号に影響する可能性がある。特定パケットの消失に起因するエラー伝搬は、「イントラ」(I)リフレッシュ・ポイントまたは時間予測を全く使用しないフレームでのみ終了する。 Transmission of digital video in packet-based networks, such as networks based on the Internet Protocol (IP), is extremely difficult due to at least the fact that data transport is usually done on a best effort basis. In modern packet-based communication systems, errors usually appear as packet loss rather than bit errors. Furthermore, such packet loss is usually the result of congestion at intermediate routers and not the result of physical layer errors (one exception to this is wireless networks and cellular networks). When an error occurs in transmitting or receiving a video signal, it is important to ensure that the receiver can quickly recover from the error and return to an error-free display of the incoming video signal. However, in typical digital video communication systems, receiver robustness is reduced by the fact that incoming data is strongly compressed to save bandwidth. In addition, video compression techniques used in communication systems (for example, the latest codecs ITU-T H.264 and H.263 or ISO MPEG-2 codec and MPEG-4 codec) are used for sequential video packets or sequential. It can create a very strong temporal dependency between video frames. Specifically, the use of motion compensated prediction (e.g., with the use of P or B frames) codecs is a chain of frame dependencies where the displayed frame depends on the past frame (s). Produce. The dependency chain can extend to the beginning of the video sequence. As a result of the dependency chain, the loss of a particular packet can affect the decoding of multiple subsequent packets at the receiver. Error propagation due to the loss of a particular packet ends only in frames that do not use "intra" (I) refresh points or temporal prediction at all.

デジタルビデオ通信システムでのエラー耐性は、伝送される信号において少なくともある程度のレベルの冗長性を有することを必要とする。しかし、この要件は、伝送される信号に於いて冗長性をなくし、または最小限に抑えるように試みるビデオ圧縮技法の目的と相入れない。 Error resilience in digital video communication systems requires having at least some level of redundancy in the transmitted signal. However, this requirement is incompatible with the purpose of video compression techniques that attempt to eliminate or minimize redundancy in the transmitted signal.

差異化サービスを提供するネットワーク(例えばDiffServ IPベースのネットワーク、専用回線を介するプライベートネットワークなど)上では、ビデオデータ通信アプリケーションが、ネットワーク機能を利用して、ビデオ信号データの一部またはすべてを受信側に無損失またはほぼ無損失で配信することができる。しかし、差異化サービスを提供しない任意のベストエフォート型ネットワーク(インターネットなど)では、データ通信アプリケーションは、エラー耐性を達成するためにそれ自体の機能に依拠しなければならない。一般のデータ通信で有用な周知の技法(例えば伝送制御プロトコル-TCP)は、ヒューマンインターフェース要件から生じる低エンドツーエンド遅延という制約が追加されるビデオまたはオーディオの通信には適していない。例えば、ファイル転送プロトコルを使用するデータ移送でのエラー耐性のためにTCP技法を使用することができる。TCPは、数秒の遅延を伴う場合であっても、すべてのデータが受信されたという確認まで、データを再送信し続ける。しかし、TCPは、限界のないエンドツーエンド遅延が参加者にとっては受け入れられないものとなるので、ライブまたは対話式のテレビ会議アプリケーションでのビデオデータ移送には不適切である。 On networks that provide differentiated services (for example, DiffServ IP-based networks, private networks over leased lines, etc.), video data communication applications use network functions to receive some or all of the video signal data. Can be delivered lossless or almost lossless. However, in any best effort network (such as the Internet) that does not provide differentiated services, data communication applications must rely on their own capabilities to achieve error resilience. Well-known techniques useful in general data communication (eg, Transmission Control Protocol-TCP) are not suitable for video or audio communication, which adds the low end-to-end delay constraints that arise from human interface requirements. For example, TCP techniques can be used for error resilience in data transport using a file transfer protocol. TCP continues to retransmit data until confirmation that all data has been received, even with a delay of a few seconds. However, TCP is unsuitable for video data transport in live or interactive video conferencing applications, as endless end-to-end delays are unacceptable for participants.

関連する問題は、ランダムアクセスの問題である。受信側がビデオ信号の既存の伝送に参加すると仮定する。典型的な例は、テレビ会議に参加するユーザ、またはブロードキャストに同調するユーザである。そのようなユーザは、ユーザが復号化を開始することができ、ユーザがエンコーダと同期する着信ビットストリーム中のポイントを見つけなければならない。しかし、そのようなランダムアクセスポイントを設けることは、圧縮効率にかなりの影響を及ぼす。ランダムアクセスポイントでは任意のエラー伝播が終了するので、定義上、そのポイントはエラー耐性機能であることに留意されたい(すなわち、そのポイントはエラー回復ポイントである)。したがって、特定のコーディング方式によって提供されるランダムアクセスサポートが良好であるほど、その特定のコーディング方式が実現するエラー回復が高速になる。その逆は常に真であるわけではなく、エラー耐性技法が対処するように設計されたエラーの持続時間および範囲について行われた仮定に依存する。エラー耐性について、エラーが生じた時に受信側である状態情報が利用可能であると仮定することができる。 A related problem is that of random access. Assume that the receiver participates in an existing transmission of the video signal. A typical example is a user attending a video conference or a user tuned to a broadcast. Such a user must find a point in the incoming bitstream where the user can start decoding and where the user synchronizes with the encoder. However, providing such a random access point has a significant impact on compression efficiency. Note that any error propagation ends at a random access point, so by definition that point is an error resilience function (ie, that point is an error recovery point). Therefore, the better the random access support provided by a particular coding scheme, the faster the error recovery achieved by that particular coding scheme. The converse is not always true and depends on assumptions made about the duration and extent of errors that the error resilience technique is designed to address. For error resilience, it can be assumed that the status information on the receiving side is available when an error occurs.

例として、ディジタルテレビジョンシステム(ディジタルケーブルTVまたは衛星TV)用のMPEG-2ビデオコーデックでは、Iピクチャが、ストリームへのすばやい切替を可能にするために周期的間隔(通常は0.5秒)で使用される。しかし、Iピクチャは、そのPまたはB対応物よりかなり大きく(通常は3〜6倍)、したがって、特に低帯域幅および/または低遅延アプリケーションでは回避しなければならない。 As an example, in an MPEG-2 video codec for a digital television system (digital cable TV or satellite TV), I pictures are used at periodic intervals (typically 0.5 seconds) to allow quick switching to a stream Is done. However, an I picture is significantly larger (usually 3-6 times) than its P or B counterpart, and therefore must be avoided, especially in low bandwidth and / or low delay applications.

テレビ会議などの対話式アプリケーションでは、イントラ更新の要求という概念が、エラー耐性のためにしばしば使用される。動作の際に、更新は、デコーダの同期を可能にするイントラピクチャ伝送を求める受信側から送信側への要求を含む。この動作の帯域幅オーバヘッドはかなりのものである。さらに、パケットエラーが生じるときにもこのオーバヘッドを受ける。パケット紛失が輻輳によって引き起こされる場合、イントラピクチャの使用は輻輳問題を悪化させるだけである。 In interactive applications such as video conferencing, the concept of requesting an intra update is often used for error resilience. In operation, the update includes a request from the receiving side to the transmitting side for intra-picture transmissions that allow the decoder to synchronize. The bandwidth overhead of this operation is considerable. Furthermore, this overhead is also incurred when a packet error occurs. If packet loss is caused by congestion, the use of intra pictures only exacerbates the congestion problem.

IDCT実装(例えばH.261標準)での不整合によって引き起こされるドリフトを軽減するために過去に使用された、エラー耐性のための別の従来の技法は、各マクロブロックをイントラモードで周期的にコーディングすることである。H.261標準は、マクロブロックが132回送信されるごとに強制的イントラコーディングを必要とする。 Another conventional technique for error resilience, used in the past to mitigate drift caused by inconsistencies in IDCT implementations (e.g. H.261 standard), is to periodically each macroblock in intra mode. Is to code. The H.261 standard requires forced intracoding every time a macroblock is transmitted 132 times.

コーディング効率は、所与のフレームでイントラとして強制的にコーディングされるマクロブロックのパーセンテージが増加すると共に減少する。逆に、このパーセンテージが低いとき、パケット紛失から回復するための時間が増大する。強制的イントラコーディングプロセスは、動き関連のドリフトを回避するために余分の注意を必要とし、それがさらにエンコーダの性能を制限する。ある動きベクトル値が最も効果的である場合であってもそれを回避しなければならないからである。 Coding efficiency decreases as the percentage of macroblocks that are forced to be coded as intra in a given frame increases. Conversely, when this percentage is low, the time to recover from packet loss increases. The forced intracoding process requires extra care to avoid motion-related drift, which further limits the performance of the encoder. This is because even if a certain motion vector value is most effective, it must be avoided.

従来型単層コーデックに加えて、階層化コーディングまたはスケーラブルコーディングが、マルチメディアデータ符号化での周知の技法である。スケーラブルコーディングは、帯域幅効率の良い方式で所与の媒体を集合的に表す2つ以上の「スケーリングされた」ビットストリームを生成するのに使用される。スケーラビリティをいくつかの異なる次元、すなわち時間、空間、および品質として提供することができる(SNR「信号対雑音比」スケーラビリティ又は忠実度スケーラビリティとも呼ばれる)。例えば、ビデオ信号を、CIF解像度およびQCIF解像度で、フレームレート7.5、15、および30フレーム/秒(fps)で異なる層としてスケーラブルにコーディングすることができる。コーデックの構造に応じて、コーデックビットストリームから空間解像度とフレームレートの任意の組合せを得ることができる。異なる層に対応するビットを別々のビットストリームとして送信することができ(すなわち層当たり1ストリーム)、またはそれらを互いに1つまたは複数のビットストリームとして多重化することができる。本明細書では説明の都合上、様々な層が単一のビットストリームとして多重化および送信される場合であっても、特定の層に対応するコード化ビットをその層のビットストリームと呼ぶことがある。スケーラビリティ機能を提供するように特に設計されたコーデックは、例えばMPEG-2(ISO/IEC13818-2、別名ITU-T H.262)、現在開発されているSVC(ITU-T H.264 Annex GまたはMPEG-4 Part 10 SVCとして知られる)を含む。ビデオ通信用に特に設計されたスケーラブルコーディング技法が本願の譲受人に譲渡された国際特許出願PCT/US06/028365、「SYSTEM AND METHOD FOR SCALABLE AND LOW-DELAY VIDEOCONFERENCING USING SCALABLE VIDEO CODING」に記載されている。スケーラブルとなるように特に設計されていないコーデックであっても、時間次元ではスケーラビリティ特性を示すことができることに留意されたい。例えば、DVD環境およびデジタルTV環境で使用される非スケーラブルコーデックであるMPEG-2 Main Profileコーデックを考慮する。さらに、コーデックが30fpsで動作し、IBBPBBPBBPBBPBBのGOP構造(周期N=15フレーム)が使用されると仮定する。Bピクチャの順次除去と、その後に続くPピクチャの除去により、30fps(すべてのピクチャタイプが含まれる)、10(IおよびPのみ)、および2fps(Iのみ)の合計3つの時間解像度を導出することが可能である。PピクチャのコーディングがBピクチャに依拠せず、同様にIピクチャのコーディングが他のPピクチャまたはBピクチャに依拠しないようにMPEG-2 Main Profileコーデックが設計されるので、順次除去プロセスの結果、復号可能なビットストリームが得られる。以下では、時間スケーラビリティ機能を有する単層コーデックが、スケーラブルビデオコーディングの特別なケースであるとみなされ、したがって明示的な指示がない限り、スケーラブルビデオコーディングという用語に含まれる。 In addition to conventional single layer codecs, layered coding or scalable coding is a well-known technique in multimedia data coding. Scalable coding is used to generate two or more “scaled” bitstreams that collectively represent a given medium in a bandwidth efficient manner. Scalability can be provided in several different dimensions: time, space, and quality (also referred to as SNR “signal to noise ratio” scalability or fidelity scalability). For example, video signals can be scalable coded as different layers at frame rates 7.5, 15, and 30 frames per second (fps) with CIF and QCIF resolution. Depending on the structure of the codec, any combination of spatial resolution and frame rate can be obtained from the codec bitstream. Bits corresponding to different layers can be transmitted as separate bitstreams (ie, one stream per layer) or they can be multiplexed together as one or more bitstreams. For the purposes of this description, the coded bits corresponding to a particular layer may be referred to as that layer's bitstream, even if the various layers are multiplexed and transmitted as a single bitstream. is there. Codecs that are specifically designed to provide scalability features include, for example, MPEG-2 (ISO / IEC13818-2, also known as ITU-T H.262), currently developed SVC (ITU-T H.264 Annex G or MPEG-4 Part 10 SVC). A scalable coding technique specifically designed for video communication is described in International Patent Application PCT / US06 / 028365, "SYSTEM AND METHOD FOR SCALABLE AND LOW-DELAY VIDEOCONFERENCING USING SCALABLE VIDEO CODING", assigned to the assignee of the present application. . Note that even codecs that are not specifically designed to be scalable can exhibit scalability characteristics in the time dimension. For example, consider the MPEG-2 Main Profile codec, which is a non-scalable codec used in DVD and digital TV environments. Furthermore, it is assumed that the codec operates at 30 fps and uses the GOP structure (period N = 15 frames) of IBBPBBPBBPBBPBB. Sequential removal of B pictures followed by removal of P pictures leads to a total of three temporal resolutions of 30 fps (including all picture types), 10 (I and P only), and 2 fps (I only) It is possible. Since the MPEG-2 Main Profile codec is designed so that the coding of P pictures does not depend on B pictures, and similarly the coding of I pictures does not depend on other P pictures or B pictures, decoding as a result of the sequential removal process A possible bitstream is obtained. In the following, a single layer codec with temporal scalability capability is considered to be a special case of scalable video coding and is therefore included in the term scalable video coding unless there is an explicit indication.

スケーラブルコーデックは通常、構成要素であるビットストリームのうちの1つ(「ベース層」と呼ばれる)がある基本的品質で元の媒体を回復する際に不可欠である、ピラミッド状のビットストリーム構造を有する。ベース層と共に1つまたは複数の残りのビットストリーム(「拡張層」と呼ばれる)を使用することは、回復される媒体の品質を向上させる。拡張層でのデータ紛失は許容可能であるが、ベース層でのデータ紛失は、著しいひずみを引き起こす可能性があり、または回復される媒体が完全に失われる可能性がある。 A scalable codec typically has a pyramidal bitstream structure that is essential in recovering the original medium with a basic quality of one of the constituent bitstreams (called the “base layer”) . Using one or more remaining bitstreams (referred to as “enhancement layer”) with the base layer improves the quality of the recovered media. While loss of data at the enhancement layer is acceptable, loss of data at the base layer can cause significant distortion, or the recovered media can be completely lost.

スケーラブルコーデックは、エラー耐性およびランダムアクセスに関して単一層コーデックが提起するものに類似する課題を提起する。しかし、スケーラブルコーデックのコーディング構造は、単一層ビデオコーデックには存在しない固有の特性を有する。さらに、単一層コーディングとは異なって、スケーラブルコーディングは、あるスケーラビリティ層から別のスケーラビリティ層への切替を含むことができる(たとえば、CIF解像度とQCIF解像度との間で行き来する切替)。異なる解像度の間で切り替える時の非常に少ないビットレートオーバーヘッドを伴う瞬間的層切替は、複数の信号分解能(空間/時間/品質)がエンコーダから使用可能である場合があるスケーラブルコーディングシステムで、ランダムアクセスのために望ましい。 Scalable codecs pose challenges similar to those raised by single-layer codecs for error resilience and random access. However, the coding structure of a scalable codec has unique characteristics that do not exist in a single layer video codec. Further, unlike single layer coding, scalable coding can include switching from one scalability layer to another (eg, switching back and forth between CIF resolution and QCIF resolution). Instantaneous layer switching with very little bit rate overhead when switching between different resolutions is a scalable coding system where multiple signal resolutions (space / time / quality) may be available from the encoder, random access Desirable for.

エラー耐性およびランダムアクセスの問題に関連する問題が、レート制御の問題である。通常のビデオエンコーダの出力は、予測技法、変換技法、およびエントロピコーディング技法の多大な使用に起因して、可変ビットレートとなる。固定ビットレートストリームを構成するために、バッファ制限式レート制御が、通常、ビデオ通信システムで使用される。そのようなシステムでは、エンコーダの出力バッファが仮定され、この出力バッファは、一定のレート(チャネルレート)で空にされ、エンコーダは、バッファの占有を監視し、バッファオーバーフローまたはバッファアンダーフローを防ぐためにパラメータ選択を行う(たとえば、量子化器ステップサイズ)。しかし、そのようなレート制御機構は、エンコーダでのみ適用でき、さらに、所望の出力レートが既知であることを前提とする。ビデオ会議を含む一部のビデオ通信アプリケーションでは、そのようなレート制御判断が、送信側と受信側との間に配置される中間ゲートウェイ(たとえば、多地点間通信装置(Multipoint Control Unit)すなわちMCU)で行われることが望ましい。ビットストリームレベルの操作またはトランスコーディングは、ゲートウェイで使用することができるが、かなりの処理コストおよび複雑さのコストを伴う。したがって、中間ゲートウェイでの追加処理を全く必要とせずにレート制御を達成する技法を使用することが望ましい。 A problem associated with error tolerance and random access problems is the problem of rate control. The output of a typical video encoder will be a variable bit rate due to extensive use of prediction, transformation, and entropy coding techniques. Buffer limited rate control is typically used in video communication systems to construct a constant bit rate stream. In such a system, an encoder output buffer is assumed, which is emptied at a constant rate (channel rate), and the encoder monitors buffer occupancy and prevents buffer overflow or buffer underflow. Perform parameter selection (eg, quantizer step size). However, such a rate control mechanism can only be applied at the encoder and further assumes that the desired output rate is known. In some video communication applications, including video conferencing, such a rate control decision is an intermediate gateway (e.g., Multipoint Control Unit or MCU) that is placed between the sender and receiver It is desirable to be performed at. Bitstream level manipulation or transcoding can be used at the gateway, but with significant processing and complexity costs. Therefore, it is desirable to use a technique that achieves rate control without requiring any additional processing at the intermediate gateway.

これから、エラー耐性およびコーディングされたビットストリームへのランダムアクセスの機能の改善ならびにビデオ通信システムでのレート制御を考慮する。エンドツーエンド遅延およびシステムによって使用される帯域幅に対する最小限の影響を有するエラー耐性技法、レート制御技法、およびランダムアクセス技法の開発に注意を向ける。 We will now consider improved error resilience and random access to the coded bitstream as well as rate control in video communication systems. Attention is directed to the development of error resilience techniques, rate control techniques, and random access techniques with minimal impact on end-to-end delay and bandwidth used by the system.

本発明は、スケーラブルビデオコーディングを使用するビデオ通信システムでエラー耐性を高め、ランダムアクセス機能およびレート制御機能を提供するシステムおよび方法を提供する。このシステムおよび方法は、優れたレート-ひずみ性能を伴う、コーディングされた解像度と異なる解像度での出力信号の導出をも可能にする。 The present invention provides systems and methods that increase error resilience and provide random access and rate control functions in video communication systems that use scalable video coding. This system and method also allows the derivation of output signals at a resolution different from the coded resolution with excellent rate-distortion performance.

一実施形態で、本発明は、低解像度空間層からの情報を使用することによって、高解像度空間スケーラブル層のパケットの消失から回復する機構を提供する。もう1つの実施形態で、本発明は、遅延をほとんどまたは全く伴わずに低空間分解能または低SNR分解能から高空間分解能または高SNR分解能へ切り替える機構を提供する。もう1つの実施形態で、本発明は、エンコーダまたは中間ゲートウェイ(たとえば、MCU)が、受信器での受信信号の品質に対する消失パケットの影響を最小にする適当なエラー回復機構の使用を予想して、高解像度空間層のパケットを選択的に除去する、レート制御を実行する機構を提供する。もう1つの実施形態で、エンコーダまたは中間ゲートウェイは、高解像度空間層からのパケットを、ベース層からの情報および拡張層の過去のフレームを使用することによって置換される高解像度データへの近似を再構成するようにエンコーダに効率的に指示する情報に選択的に置換する。もう1つの実施形態で、本発明は、コーディングされた解像度と異なる解像度の、具体的には空間スケーラブルコーディングに使用される解像度の間の中間解像度の出力ビデオ信号を導出する機構を説明する。これらの実施形態は、単独でまたは組み合わされて、重要なレート制御および解像度柔軟性ならびにエラー耐性およびランダムアクセスを有するビデオ通信システムの構築を可能にする。 In one embodiment, the present invention provides a mechanism to recover from the loss of a high resolution spatial scalable layer packet by using information from the low resolution spatial layer. In another embodiment, the present invention provides a mechanism for switching from low spatial resolution or low SNR resolution to high spatial resolution or high SNR resolution with little or no delay. In another embodiment, the present invention contemplates that the encoder or intermediate gateway (e.g., MCU) will use a suitable error recovery mechanism that minimizes the impact of lost packets on the quality of the received signal at the receiver. Provide a mechanism to perform rate control, selectively removing high resolution spatial layer packets. In another embodiment, the encoder or intermediate gateway re-approximates the packet from the high-resolution spatial layer to the high-resolution data that is replaced by using information from the base layer and past frames in the enhancement layer. It selectively replaces with information that efficiently instructs the encoder to configure. In another embodiment, the present invention describes a mechanism for deriving an output video signal with a resolution different from the coded resolution, in particular an intermediate resolution between the resolutions used for spatial scalable coding. These embodiments, alone or in combination, allow the construction of video communication systems with significant rate control and resolution flexibility as well as error resilience and random access.

発明システムおよび方法は、スケーラブルコーディング技法と併用される「エラー隠蔽(error concealment)」技法に基づく。これらの技法は、スケーラブルビデオエンコーダと称する特定のビデオエンコーダの系列に関してエラー耐性およびレート制御を同時に達成する。エラー隠蔽技法のレート-ひずみ性能は、有効転送レート(総伝送レートから消失パケットのレートを引いた値)でのコーディングのレート-ひずみ性能と一致するかそれを超えるようになっている。ピクチャコーディング構造およびトランスポートモードを適当に選択することによって、これらの技法は、非常に少ないビットレートオーバーヘッドで、ほぼ瞬間的な層切替を可能にする。 The inventive system and method is based on an “error concealment” technique used in conjunction with a scalable coding technique. These techniques simultaneously achieve error resilience and rate control for a particular sequence of video encoders called scalable video encoders. The rate-distortion performance of the error concealment technique matches or exceeds the rate-distortion performance of coding at the effective transfer rate (the total transmission rate minus the rate of lost packets). By appropriate selection of picture coding structure and transport mode, these techniques allow near instantaneous layer switching with very little bit rate overhead.

さらに、これらの技法を使用して、コーディングされた解像度(1つまたは複数)と異なる解像度の受信信号の復号された版を導出することができる。これは、たとえば、QCIF解像度およびCIF解像度の空間スケーラブルコーディングされた信号からの1/2 CIF (HCIF)信号の作成を可能にする。通常のスケーラブルコーディングとは異なって、受信器は、QCIF信号を使用し、これをアップサンプリングしなければならない(低い品質を伴う)か、あるいはCIF信号を使用し、これをダウンサンプリングする(良い品質を伴うが高いビットレート利用を伴う)のいずれかである。同一の問題が、QCIFおよびCIFが単一層ストリームとして同時放送される場合にも存在する。 Further, these techniques can be used to derive a decoded version of the received signal that has a resolution that differs from the coded resolution (s). This allows, for example, the creation of a 1/2 CIF (HCIF) signal from a spatially scalable coded signal of QCIF resolution and CIF resolution. Unlike normal scalable coding, the receiver must use the QCIF signal and upsample it (with low quality), or use the CIF signal and downsample it (good quality). With high bit rate utilization). The same problem exists when QCIF and CIF are broadcast simultaneously as a single layer stream.

これらの技法は、エンコードされたビデオビットストリームの最小限の処理を伴い、画質に悪影響を及ぼさない、レート制御をも提供する。 These techniques also provide rate control with minimal processing of the encoded video bitstream and without adversely affecting image quality.

本発明のさらなる特徴、性質、およびさまざまな利益は、好ましい実施形態および添付図面の次の詳細な説明からより明白になる。 Further features of the invention, its nature and various advantages will be more apparent from the following detailed description of the preferred embodiments and the accompanying drawings.

図面全体を通じて、同一の符号および文字は、そうではないと示されない限り、示される実施形態の同じ特徴、要素、構成要素、または部分を示すのに使用される。さらに、これから、本発明を図面を参照して詳細に説明するが、これは、例示的実施形態に関連して行われる。 Throughout the drawings, identical reference numerals and characters are used to indicate the same features, elements, components, or parts of the illustrated embodiments unless otherwise indicated. Further, the present invention will now be described in detail with reference to the drawings, which are done in connection with the exemplary embodiments.

本発明の原理によるビデオ会議システムの全体的アーキテクチャを示すブロック図である。1 is a block diagram illustrating the overall architecture of a video conferencing system in accordance with the principles of the present invention. 例示的エンドユーザ端末を示すブロック図である。 Examples expressly is a block diagram showing an end-user terminal. 本発明の原理による例示的エンドユーザ端末を示すブロック図である。FIG. 2 is a block diagram illustrating an exemplary end user terminal in accordance with the principles of the present invention. 本発明の原理による例示的ピクチャコーディング構造を示す図である。FIG. 3 illustrates an exemplary picture coding structure in accordance with the principles of the present invention. 本発明の原理による代替のピクチャコーディング構造の例を示す図である。FIG. 6 illustrates an example of an alternative picture coding structure according to the principles of the present invention. 本発明の原理による空間拡張層用のビデオエンコーダの例示的アーキテクチャを示すブロック図である。FIG. 2 is a block diagram illustrating an exemplary architecture of a video encoder for a spatial enhancement layer according to the principles of the present invention. 本発明の原理による、空間スケーラビリティが使用される時の例示的ピクチャコーディング構造を示す図である。FIG. 3 illustrates an exemplary picture coding structure when spatial scalability is used in accordance with the principles of the present invention. 本発明の原理による拡張層ピクチャの隠蔽を伴う例示的復号処理を示す図である。FIG. 6 illustrates an exemplary decoding process with enhancement layer picture concealment according to the principles of the present invention. 本発明の原理による「フォアマン」シーケンスに適用される時の隠蔽処理の例示的R-D曲線を示す図である。FIG. 5 shows an exemplary RD curve of concealment processing when applied to a “foreman” sequence according to the principles of the present invention. 本発明の原理による、SRピクチャを用いる空間スケーラビリティが使用される時の例示的ピクチャコーディング構造を示す図である。FIG. 6 illustrates an exemplary picture coding structure when spatial scalability using SR pictures is used in accordance with the principles of the present invention.

ビデオ通信システムでのエラー耐性のある伝送、ランダムアクセス、およびレート制御のシステムおよび方法を提供する。このシステムおよび方法は、ビデオ通信システムで使用できるスケーラブルビデオコーディングの諸特徴に基づくエラー隠蔽技法を活用する。 Systems and methods for error tolerant transmission, random access, and rate control in video communication systems are provided. The system and method takes advantage of error concealment techniques based on features of scalable video coding that can be used in video communication systems.

好ましい実施形態では、例示的ビデオ通信システムは、パケットベースのネットワークを介して運用されるマルチポイントテレビ会議システム10でよい(例えば図1を参照)。マルチポイントテレビ会議システムは、ネットワークを介するエンドポイント(例えばユーザ1-kおよび1-m)間のスケーラブル多層または単層ビデオ通信を仲介するために、任意選択のブリッジ120aおよび120b(例えばマルチポイント制御ユニット(MCU)またはスケーラブルビデオ通信サーバ(SVCS))を含むことができる。例示的ビデオ通信システムの動作は、任意選択のブリッジ120aおよび120bの使用を伴う、または伴わないポイントツーポイント接続について同じであり、そのポイントツーポイント接続にとって有利である。本発明で説明される技法は、ポイントツーポイントストリーミング、ブロードキャスティング、マルチキャスティングなどを含め、他のすべてのビデオ通信アプリケーションにそのまま適用できる。 In a preferred embodiment, the exemplary video communication system may be a multipoint videoconferencing system 10 operated over a packet-based network (see, eg, FIG. 1). The multipoint videoconferencing system can include optional bridges 120a and 120b (e.g. multipoint control) to mediate scalable multilayer or single layer video communication between endpoints (e.g. users 1-k and 1-m) over the network. A unit (MCU) or a scalable video communication server (SVCS)) can be included. The operation of the exemplary video communication system is the same and advantageous for point-to-point connections with or without the use of optional bridges 120a and 120b. The techniques described in the present invention are directly applicable to all other video communication applications, including point-to-point streaming, broadcasting, multicasting, etc.

スケーラブルビデオコーディング技法およびスケーラブルビデオコーディングに基づくビデオ会議システムの詳細な説明は、たとえば、本願の譲受人に譲渡された国際特許出願第PCT/US06/28365号および第PCT/US06/28366号で提供される。さらに、スケーラブルビデオコーディング技法およびスケーラブルビデオコーディングに基づくビデオ会議システムの説明が、本願の譲受人に譲渡された国際特許出願第PCT/US06/62569号および第PCT/US06/061815号で提供される。 Detailed descriptions of scalable video coding techniques and video conferencing systems based on scalable video coding are provided, for example, in International Patent Applications Nos. PCT / US06 / 28365 and PCT / US06 / 28366 assigned to the assignee of the present application. The In addition, descriptions of scalable video coding techniques and video conferencing systems based on scalable video coding are provided in International Patent Applications Nos. PCT / US06 / 62569 and PCT / US06 / 061815, assigned to the assignee of the present application.

図1は、テレビ会議システム10の一般的構造を示す。テレビ会議システム10は、LAN1および2ならびにサーバ120aおよび120bを通じてネットワーク100を介してリンクされる複数のエンドユーザ端末(例えばユーザ1-kおよびユーザ1-m)を含む。サーバは、従来型MCUでよく、あるいはスケーラブルビデオコーディングサーバ(SVCS)または合成スケーラブルビデオコーディングサーバ(CSVCS)でよい。後者のサーバは、従来型MCUと同じ目的を有するが、複雑さが著しく低減され、機能が改善されている(例えば国際特許出願PCT/US06/28366およびPCT/US06/62569を参照)。本明細書の説明では、「サーバ」という用語をSVCSまたはCSVCSを総称的に指すのに使用することがある。 FIG. 1 shows a general structure of the video conference system 10. The video conference system 10 includes a plurality of end user terminals (for example, a user 1-k and a user 1-m) linked via the network 100 through the LANs 1 and 2 and the servers 120a and 120b. The server may be a conventional MCU or may be a scalable video coding server (SVCS) or a synthetic scalable video coding server (CSVCS). The latter server has the same purpose as a conventional MCU, but with significantly reduced complexity and improved functionality (see, eg, International Patent Applications PCT / US06 / 28366 and PCT / US06 / 62569). In the description herein, the term “server” may be used generically to refer to SVCS or CSVCS.

図３に、複数層コーディングに基づくビデオ会議システム(たとえば、システム100)と共に使用するために設計されたエンドユーザ端末140のアーキテクチャを示す。端末140は、ヒューマンインターフェース入出力デバイス(たとえば、カメラ210A、マイクロホン210B、ビデオディスプレイ250C、スピーカ250D)と、入力および出力の信号マルチプレクサユニットおよび信号デマルチプレクサユニット(たとえば、パケットMUX 220AおよびパケットDMUX 220B)に結合された1つまたは複数のネットワークインターフェースコントローラカード(NIC)230とを含む。NIC 230は、イーサネット（登録商標）LANアダプタなどの標準ハードウェア構成要素、任意の他の適切なネットワークインターフェースデバイス、またはこれらの組合せとすることができる。
FIG. 3 illustrates the architecture of an end user terminal 140 designed for use with a video conferencing system (eg, system 100) based on multi-layer coding. Terminal 140 includes human interface input / output devices (e.g., camera 210A, microphone 210B, video display 250C, speaker 250D) and input and output signal multiplexer and signal demultiplexer units (e.g., packet MUX 220A and packet DMUX 220B). One or more network interface controller cards (NICs) 230 coupled to each other. The NIC 230 can be a standard hardware component such as an Ethernet LAN adapter, any other suitable network interface device, or a combination thereof.

カメラ210Aおよびマイクロフォン210Bは、他の会議参加者に送信するために参加者ビデオ信号および参加者オーディオ信号をそれぞれ取り込むように設計される。逆に、ビデオディスプレイ250Cおよびスピーカ250Dは、他の参加者から受信したビデオ信号およびオーディオ信号をそれぞれ表示および再生するように設計される。ビデオディスプレイ250Cは、参加者/端末140自体のビデオを任意選択で表示するように構成することもできる。カメラ210A出力およびマイクロフォン210B出力は、それぞれアナログ-デジタル変換器210Eおよび210Fを介してビデオエンコーダ210Gおよびオーディオエンコーダ210Hに結合される。ビデオエンコーダ210Gおよびオーディオエンコーダ210Hは、電子通信ネットワークを介する信号の伝送のために必要な帯域幅を削減するために、入力ビデオデジタル信号および入力オーディオデジタル信号を圧縮するように設計される。入力ビデオ信号は、ライブビデオ信号でよく、または事前記録され、格納されたビデオ信号でよい。エンコーダは、信号の伝送に必要な帯域幅を最小限に抑えるためにローカルデジタル信号を圧縮する。 Camera 210A and microphone 210B are designed to capture participant video signals and participant audio signals, respectively, for transmission to other conference participants. Conversely, video display 250C and speaker 250D are designed to display and play video and audio signals received from other participants, respectively. Video display 250C may also be configured to optionally display the video of participant / terminal 140 itself. Camera 210A output and microphone 210B output are coupled to video encoder 210G and audio encoder 210H via analog-to-digital converters 210E and 210F, respectively. Video encoder 210G and audio encoder 210H are designed to compress the input video digital signal and the input audio digital signal to reduce the bandwidth required for transmission of the signal over the electronic communication network. The input video signal may be a live video signal or a pre-recorded and stored video signal. The encoder compresses the local digital signal to minimize the bandwidth required for signal transmission.

本発明の例示的実施形態では、当技術分野で周知の任意の適切な技法(例えばG.711、G.729、G.729EV、MPEG-1など)を使用してオーディオ信号を符号化することができる。本発明の好ましい実施形態では、スケーラブルオーディオコーデックG.729EVがオーディオエンコーダ210Gで使用されてオーディオ信号が符号化される。オーディオエンコーダ210Gの出力はマルチプレクサMUX220Aに送られ、NIC230を通じてネットワーク100を介して伝送される。 In an exemplary embodiment of the invention, the audio signal is encoded using any suitable technique known in the art (e.g., G.711, G.729, G.729EV, MPEG-1, etc.). Can do. In the preferred embodiment of the present invention, the scalable audio codec G.729EV is used in the audio encoder 210G to encode the audio signal. The output of the audio encoder 210G is sent to the multiplexer MUX220A, and transmitted via the network 100 through the NIC230.

パケットMUX220Aは、RTPプロトコルを使用して従来型多重化を実施することができる。パケットMUX220Aは、ネットワーク100、又は直接ビデオ通信アプリケーション（例えば国際特許出願PCT/US06/061815を参照）で提供することのできる任意の関連クオリティ・オブ・サービス(QoS)処理を実施することができる。端末140からのデータの各ストリームが、それ自体の仮想チャネル、またはIP用語では「ポート番号」で送信される。 The packet MUX 220A can perform conventional multiplexing using the RTP protocol. The packet MUX 220A may implement any associated quality of service (QoS) processing that can be provided in the network 100 or directly in a video communication application (see, eg, International Patent Application PCT / US06 / 061815). Each stream of data from the terminal 140 is transmitted on its own virtual channel, or “port number” in IP terminology.

ビデオエンコーダ210Gは、さまざまな層(この図では「ベース」および「拡張」というラベルが付けられている)に対応する複数の出力を有するスケーラブルビデオエンコーダである。同時放送が、層間予測が行われない、スケーラブルコーディングの特別な事例であることに留意されたい。次では、用語スケーラブルコーディングが使用される時に、この用語スケーラブルコーディングは、同時放送事例を含む。ビデオエンコーダの動作および複数の出力の性質を、本明細書で、下でより詳細に説明する。 Video encoder 210G is a scalable video encoder having multiple outputs corresponding to various layers (labeled “base” and “extension” in this figure). Note that simultaneous broadcasting is a special case of scalable coding where no inter-layer prediction is performed. In the following, when the term scalable coding is used, this term scalable coding includes simultaneous broadcast cases. The operation of the video encoder and the nature of the outputs are described in more detail herein below.

H.264標準規格仕様では、柔軟なマクロブロック順序(flexible macroblock ordering (FMO))方式を使用することによって、複数の参加者のビューを単一のコーディングされたピクチャに組み合わせることが可能である。この方式では、各参加者は、コーディングされたイメージのうちで、そのスライスのうちの1つに対応する部分を占める。概念上、単一のデコーダを使用して、すべての参加者信号を復号することができる。しかし、実用的見地からは、受信器/端末は、複数のより小さい独立にコーディングされたスライスを復号しなければならない。したがって、デコーダ230Aを有する図３に示された端末140を、H.264仕様のアプリケーションに使用することができる。スライスを転送するサーバがCSVCSであることに留意されたい。
The H.264 standard specification allows multiple participant views to be combined into a single coded picture by using a flexible macroblock ordering (FMO) scheme. In this scheme, each participant occupies a portion of the coded image corresponding to one of its slices. Conceptually, a single decoder can be used to decode all participant signals. However, from a practical point of view, the receiver / terminal must decode multiple smaller independently coded slices. Therefore, the terminal 140 shown in FIG. 3 having the decoder 230A can be used for an application of the H.264 specification. Note that the server that transfers the slice is CSVCS.

端末140では、デマルチプレクサDMUX 220Bが、NIC 320からパケットを受け取り、それらのパケットを適当なデコーダユニット230Aにリダイレクトする。 At terminal 140, demultiplexer DMUX 220B receives packets from NIC 320 and redirects those packets to the appropriate decoder unit 230A.

端末140内のSERVER CONTROLブロックは、国際特許出願第PCT/US06/028366号および第PCT/US06/62569号に記載のように、サーバ(SVCS/CSVCS)とエンドユーザ端末との間の相互作用を調整する。中間サーバなしのポイントツーポイント通信システムでは、SERVER CONTROLブロックは不要である。同様に、非会議アプリケーションで、ポイントツーポイント会議アプリケーションで、またはCSVCSが使用される時には、単一のデコーダだけが、受信するエンドユーザ端末で必要である場合がある。格納されたビデオを含むアプリケーション(たとえば、事前に記録され、事前にコーディングされた材料の放送)の場合、送信するエンドユーザ端末は、オーディオ符号化ブロックおよびビデオ符号化ブロックの全機能性ならびにこれらに先行するすべてのブロック(カメラ、マイクロホンなど)を必要としない場合がある。具体的に言うと、下で説明する、ビデオパケットの選択的送信に関係する部分だけを設ける必要がある。 The SERVER CONTROL block in terminal 140 allows interaction between the server (SVCS / CSVCS) and the end user terminal as described in International Patent Applications Nos. PCT / US06 / 028366 and PCT / US06 / 62569. adjust. In a point-to-point communication system without an intermediate server, the SERVER CONTROL block is not necessary. Similarly, in non-conferencing applications, in point-to-point conferencing applications, or when CSVCS is used, only a single decoder may be required at the receiving end-user terminal. For applications involving stored video (e.g. pre-recorded and pre-coded material broadcasts), the transmitting end-user terminal will have the full functionality of the audio and video coding blocks as well as It may not require all the preceding blocks (camera, microphone, etc.). Specifically, it is only necessary to provide a portion related to the selective transmission of video packets described below.

単語「端末」がこの文脈で使用されるが、端末のさまざまな構成要素は、互いに相互接続された別々のデバイスでよく、パーソナルコンピュータ内で、ソフトウェアまたはハードウェアとして一体化することができ、あるいはその組合せとすることができる。 Although the word “terminal” is used in this context, the various components of the terminal may be separate devices interconnected with each other and may be integrated as software or hardware within a personal computer, or It can be a combination thereof.

ビデオエンコーダ(ベース層および時間拡張層)の例示的アーキテクチャを示すベース層ビデオエンコーダ300について説明する。エンコーダ300は、動き推定(ME)、動き補償(MC)、および他の符号化機能用の従来の「テキスト-ブック」バラエティビデオコーディング処理ブロック330に加えて、FRAME BUFFERSブロック310およびエンコーダ基準制御(ENC REF CONTROL)ブロック320を含む。ビデオエンコーダ300は、たとえば、H.264/MPEG-4 AVC (ITU-TおよびISO/IEC JTC 1、「Advanced video coding for generic audiovisual services」、ITU-T Recommendation H.264およびISO/IEC 14496-10 (MPEG4-AVC))またはSVC (J.Reichel、H.Schwarz、およびM.Wien著、「Joint Scalable Video Model JSVM 4」、JVT-Q202、Document of Joint Video Team (JVT) of ITU T SG16/Q.6 and ISO/IEC JTC 1/SC 29/WG 11、2005年10月)に従って設計することができる。たとえば国際特許出願第PCT/US06/28365号および第PCT/US06/62569号で開示された設計を含む、任意の他の適切なコーデックまたは設計が、ビデオエンコーダに使用できることを理解されたい。空間スケーラビリティが使用される場合に、DOWNSAMPLERが、入力解像度を下げる(たとえば、CIFからQCIFへ)ために任意選択で入力で使用される。
A base layer video encoder 300 illustrating an exemplary architecture of a video encoder (base layer and time enhancement layer) will be described . Encoder 300 includes FRAME BUFFERS block 310 and encoder reference control (in addition to conventional `` text-book '' variety video coding processing block 330 for motion estimation (ME), motion compensation (MC), and other coding functions. ENC REF CONTROL) block 320 is included. Video encoder 300 is, for example, H.264 / MPEG-4 AVC (ITU-T and ISO / IEC JTC 1, `` Advanced video coding for generic audiovisual services '', ITU-T Recommendation H.264 and ISO / IEC 14496-10 (MPEG4-AVC)) or SVC (J.Reichel, H.Schwarz, and M.Wien, `` Joint Scalable Video Model JSVM 4 '', JVT-Q202, Document of Joint Video Team (JVT) of ITU T SG16 / Q .6 and ISO / IEC JTC 1 / SC 29 / WG 11, October 2005). It should be understood that any other suitable codec or design can be used for the video encoder, including, for example, the designs disclosed in International Patent Applications PCT / US06 / 28365 and PCT / US06 / 62569. If spatial scalability is used, DOWNSAMPLER is optionally used on the input to reduce the input resolution (eg, from CIF to QCIF).

ENC REF CONTROLブロック300は、「スレッド式」コーディング構造を作成するのに使用される(たとえば、国際特許出願第PCT/US06/28365号を参照されたい)。標準的なブロックベースの動き補償コーデックは、Iフレーム、Pフレーム、およびBフレームの規則的構造を有する。たとえば、IBBPBBPのようなピクチャシーケンス(表示順の)では、「P」フレームは、シーケンス内の前のPフレームまたはIフレームから予測されるが、Bピクチャは、前のPフレームまたはIフレームと次のPフレームまたはIフレームとの両方を使用して予測される。連続するIピクチャまたはPピクチャの間のBピクチャの個数は、Iピクチャが現れるレートと同様に、変化する可能性があるが、たとえば、Pピクチャは最も最近のPピクチャより時間的に前の別のPピクチャを予測の基準として使用することは、不可能である。H.264コーディング標準規格は、2つの基準ピクチャリストが、このリスト内からのピクチャの並べ変えおよび選択的使用に備える適当なシグナリング情報と共に、それぞれエンコーダおよびデコーダによって維持されるという例外を有利に提供する。この例外を活用して、基準として使用されるピクチャを、そしてコーディングされるべき特定のピクチャについて使用される基準をも選択することができる。例示的なベース層ビデオエンコーダ300では、FRAME BUFFERSブロック310が、基準ピクチャリスト(1つまたは複数)を格納するメモリを表す。ENC REF CONTROLブロック320は、どの基準ピクチャがエンコーダ側で現行ピクチャに使用されるべきかを判定するように設計される。
The ENC REF CONTROL block 300 is used to create a “threaded” coding structure (see, eg, International Patent Application No. PCT / US06 / 28365). A standard block-based motion compensation codec has a regular structure of I-frames, P-frames, and B-frames. For example, in a picture sequence like IBBPBBP (in display order), the “P” frame is predicted from the previous P or I frame in the sequence, but the B picture is next to the previous P or I frame. Predicted using both P and I frames. The number of B pictures between consecutive I or P pictures can vary, as can the rate at which I pictures appear, but for example, P pictures can be different in time before the most recent P picture. It is impossible to use a P picture as a reference for prediction. The H.264 coding standard advantageously provides the exception that two reference picture lists are maintained by the encoder and decoder, respectively, with appropriate signaling information for reordering and selective use of pictures from within this list. To do. This exception can be exploited to select the picture used as a reference and also the reference used for the particular picture to be coded. In the exemplary base layer video encoder 300 , the FRAME BUFFERS block 310 represents a memory that stores the reference picture list (s). The ENC REF CONTROL block 320 is designed to determine which reference picture should be used for the current picture at the encoder side.

ENC REF CONTROLブロック320の動作は、図4に示された例示的な階層化ピクチャコーディング「スレッディング」または「予測チェーン」構造400に関してさらなるコンテキスト内に置かれ、ここで、文字「L」は、任意のスケーラビリティ層を表すために使用され、これに時間層を表す数字(0が最小または最も粗である)が続く。矢印は、予測の方向、ソース、およびターゲットを示す。L0は、単に、4ピクチャだけ離された一連の規則的なP個のピクチャである。L1は、同一のフレームレートを有するが、予測は、前のL0フレームからの予測だけが許容される。L2フレームは、最も最近のL0フレームまたはL1フレームから予測される。L0は、全時間分解能の1/4(1:4)を提供し、L1は、L0フレームレートを2倍にし(1:2)、L2は、L0+L1フレームレートを2倍にする(1:1)。 The operation of the ENC REF CONTROL block 320 is placed in further context with respect to the exemplary layered picture coding “threading” or “prediction chain” structure 400 shown in FIG. 4, where the letter “L” is optional Is used to represent the scalability layer, followed by a number representing the time layer (0 being the least or coarsest). Arrows indicate prediction direction, source, and target. L0 is simply a series of regular P pictures separated by 4 pictures. L1 has the same frame rate, but prediction is only allowed from the previous L0 frame. The L2 frame is predicted from the most recent L0 frame or L1 frame. L0 provides 1/4 (1: 4) of full time resolution, L1 doubles L0 frame rate (1: 2), L2 doubles L0 + L1 frame rate (1 : 1).

本発明の特定の実施態様の要件に応じて、追加またはより少数の層を同様に構成して、異なるビットレート/スケーラビリティ要件に対処することができる。単純な例を図5に示すが、図5では、IPPP…フレームの伝統的な予測シリーズが、2つの層に変換されている。 Depending on the requirements of certain embodiments of the invention, additional or fewer layers may be similarly configured to address different bit rate / scalability requirements. A simple example is shown in FIG. 5, where the traditional prediction series of IPPP ... frames has been converted into two layers.

本発明の諸実施形態で利用されるコーデック300は、複数レベルの時間スケーラビリティ分解能(たとえば、L0〜L2)および他の拡張分解能(たとえば、S0〜S2)を可能にするために別々のピクチャ「スレッド」からなるセット(たとえば、3つのスレッド410〜430のセット)を生成するように構成することができる。スレッドまたは予測チェーンは、同一スレッドからのピクチャまたはより下のレベルのスレッドからのピクチャのいずれかを使用して動き補償されるピクチャのシーケンスと定義される。図4の矢印は、3つのスレッド410〜430の予測の方向、ソース、およびターゲットを示す。スレッド410〜420は、共通のソースL0を有するが、異なるターゲットおよび経路(たとえば、それぞれターゲットL2、L2、およびL0)を有する。スレッドの使用は、時間スケーラビリティの実施を可能にする。というのは、任意の個数のトップレベルスレッドを、残りのスレッドの復号処理に影響せずに除去することができるからである。 The codec 300 utilized in embodiments of the present invention is a separate picture “thread” to allow multiple levels of temporal scalability resolution (eg, L0-L2) and other extended resolutions (eg, S0-S2). Can be configured to generate a set (eg, a set of three threads 410-430). A thread or prediction chain is defined as a sequence of pictures that are motion compensated using either pictures from the same thread or pictures from lower level threads. The arrows in FIG. 4 indicate the prediction direction, source, and target of the three threads 410-430. The threads 410-420 have a common source L0 but have different targets and paths (eg, targets L2, L2, and L0, respectively). The use of threads makes it possible to implement temporal scalability. This is because an arbitrary number of top level threads can be removed without affecting the decoding process of the remaining threads.

エンコーダ300内で、ENC REF CONTROLブロックが、基準ピクチャとしてPピクチャだけを使用できることに留意されたい。順方向予測と逆方向予測との両方を伴うBピクチャの使用は、Bピクチャに使用される基準ピクチャの取り込みおよび符号化に要する時間だけコーディング遅延を増やす。伝統的な対話型通信では、将来のピクチャからの予測を伴うBピクチャの使用は、コーディング遅延を増やし、したがって回避される。しかし、Bピクチャを、全体の圧縮効率の向上を伴うので使用することもできる。スレッドのセット内で単一のBピクチャを使用すること(たとえば、L2をBピクチャとしてコーディングさせることによる)さえでも、圧縮効率を改善することができる。遅延に敏感ではないアプリケーションの場合、一部またはすべてのピクチャ(L0の可能な例外を有する)を、両方向予測を伴うBピクチャとすることができる。特にH.264標準規格については、この標準規格は表示順で過去の基準ピクチャを使用する2つの動きベクトルの使用を可能にするので、余分な遅延をこうむらずにBピクチャを使用することが可能であることに留意されたい。この場合に、そのようなBピクチャは、Pピクチャコーディングと比較して、コーディング遅延が増加することなく使用することができる。同様に、L0ピクチャを、従来のグループオブピクチャ(GOP)を形成するIピクチャとすることができる。 Note that within the encoder 300, the ENC REF CONTROL block can only use a P picture as a reference picture. The use of a B picture with both forward and backward prediction increases the coding delay by the time required to capture and encode the reference picture used for the B picture. In traditional interactive communication, the use of B pictures with predictions from future pictures increases coding delay and is therefore avoided. However, B pictures can also be used because they increase the overall compression efficiency. Even using a single B picture within a set of threads (eg, by having L2 coded as a B picture) can improve compression efficiency. For applications that are not delay sensitive, some or all pictures (with possible exceptions for L0) can be B pictures with bidirectional prediction. Especially for the H.264 standard, this standard allows the use of two motion vectors that use past reference pictures in the display order, allowing the use of B pictures without incurring extra delay. Please note that. In this case, such B pictures can be used without increasing coding delay compared to P picture coding. Similarly, the L0 picture can be an I picture that forms a conventional group of pictures (GOP).

新たに例示的なベース層ビデオエンコーダ300を参照すると、ベース層エンコーダ300を増補して、たとえばH.264 SVC標準規格草案および国際特許出願第PCT/US06/28365号に記載されているように、空間拡張層および/または品質拡張層を作成することができる。図6に、空間拡張層を作成する例示的エンコーダ600の構造を示す。エンコーダ600の構造は、ベース層コーデック300に似るが、ベース層情報がエンコーダ600から使用可能にもされるという追加の特徴を有する。この情報には、動きベクトルデータ、マクロブロックモードデータ、コーディングされた予測誤差データ、および再構成された画素データを含めることができる。エンコーダ600は、拡張層のコーディング判断を行うために、この情報の一部またはすべてを再利用することができる。このために、ベース層データは、拡張層のターゲット解像度にスケーリングされる必要がある(たとえば、ベース層がQCIFであり、拡張層がCIFである場合には2倍)。空間スケーラビリティは、通常、2つのコーディングループを維持することを必要とするが、拡張層コーディングに使用されるベース層データを現行ピクチャのベース層内で符号化される情報から計算可能な値だけに制限することによって、単一ループ復号を実行することが(たとえばH.264 SVC草案標準規格の下で)可能である。たとえば、ベース層マクロブロックがインターコーディングされる場合に、拡張層は、予測の基礎として、そのマクロブロックの再構成される画素を使用することができない。しかし、拡張層は、その動きベクトルおよび予測誤差値を使用することができる。というのは、これらが、現行ベース層ピクチャに含まれる情報を復号することだけで入手可能であるからである。単一ループ復号は、デコーダの複雑さが大幅に減らされるので、望ましい。
Referring to the newly exemplary base layer video encoder 300 , the base layer encoder 300 has been augmented, as described, for example, in the draft H.264 SVC standard and international patent application No. PCT / US06 / 28365. A spatial enhancement layer and / or a quality enhancement layer can be created. FIG. 6 shows the structure of an exemplary encoder 600 that creates a spatial enhancement layer. The structure of the encoder 600 is similar to the base layer codec 300, but has the additional feature that base layer information is also made available from the encoder 600. This information can include motion vector data, macroblock mode data, coded prediction error data, and reconstructed pixel data. The encoder 600 may reuse some or all of this information to make enhancement layer coding decisions. For this purpose, the base layer data needs to be scaled to the target resolution of the enhancement layer (eg, twice if the base layer is QCIF and the enhancement layer is CIF). Spatial scalability usually requires that two coding loops be maintained, but the base layer data used for enhancement layer coding is only a value that can be calculated from the information encoded in the base layer of the current picture. By limiting, it is possible to perform single loop decoding (eg, under the H.264 SVC draft standard). For example, if a base layer macroblock is intercoded, the enhancement layer cannot use the reconstructed pixels of that macroblock as a basis for prediction. However, the enhancement layer can use its motion vector and prediction error value. This is because they can only be obtained by decoding the information contained in the current base layer picture. Single loop decoding is desirable because the complexity of the decoder is greatly reduced.

スレッディング構造は、ベース層フレームと同一の形で拡張層フレームについて利用することができる。図7に、図4に示された設計に従う拡張層フレームの例示的スレッディング構造700を示す。図7では、構造700内の拡張層ブロックが、文字「S」によって示される。拡張層フレームおよびベース層のスレッディング構造を、国際特許出願第PCT/US06/28365号で説明されているように異なるものにすることができることに留意されたい。 The threading structure can be utilized for the enhancement layer frame in the same manner as the base layer frame. FIG. 7 shows an exemplary threading structure 700 of the enhancement layer frame according to the design shown in FIG. In FIG. 7, the enhancement layer block in structure 700 is indicated by the letter “S”. Note that the threading structure of the enhancement layer frame and base layer can be different as described in International Patent Application No. PCT / US06 / 28365.

さらに、たとえばSVC草案標準規格に記載され、国際特許出願第PCT/US06/28365号に記載されているように、品質スケーラビリティ用の類似する拡張層コーデックを構成することができる。そのような品質スケーラビリティ用のコーデックでは、入力のより高解像度版に基づいて拡張層を作成するのではなく、拡張層は、入力と同一空間の分解能で残留予測誤差をコーディングすることによって作成される。空間スケーラビリティと同様に、ベース層のすべてのマクロブロックデータを、単一ループコーディング構成または二重ループコーディング構成のいずれかで、品質スケーラビリティのために拡張層で再利用することができる。 Furthermore, a similar enhancement layer codec for quality scalability can be constructed, for example as described in the SVC draft standard and as described in International Patent Application No. PCT / US06 / 28365. In such a quality-scalability codec, rather than creating an enhancement layer based on a higher resolution version of the input, the enhancement layer is created by coding the residual prediction error with the same spatial resolution as the input. . Similar to spatial scalability, all base layer macroblock data can be reused in the enhancement layer for quality scalability in either a single loop coding configuration or a double loop coding configuration.

説明を簡潔にするために、次の説明は、空間スケーラビリティに限定されるが、説明される技法を、品質スケーラビリティまたは忠実度スケーラビリティにも適用できることを理解されたい。 For the sake of brevity, the following description is limited to spatial scalability, but it should be understood that the described techniques can also be applied to quality scalability or fidelity scalability.

最新のビデオコーデックでの動き補償予測から生じる固有の時間的依存性に起因して、特定のピクチャのパケット消失が、その特定のピクチャの品質に影響するだけではなく、直接的あるいは間接的に、その特定のピクチャが基準として働くすべての将来のピクチャにも影響することに留意されたい。これは、将来の予測のためにデコーダが構成できる基準フレームが、エンコーダで使用された基準フレームと同一ではなくなるからである。それに続く差またはドリフトは、復号されたビデオ信号の視覚的品質に対してはなはだしい影響を有する可能性がある。しかし、国際特許出願第PCT/US06/28365号および第PCT/US06/061815号に記載されているように、構造400(図4)は、伝送エラーの存在下での頑健性に関する明らかな利点を有する。 Due to the inherent temporal dependence resulting from motion compensated prediction in modern video codecs, packet loss of a particular picture not only affects the quality of that particular picture, but directly or indirectly, Note that it also affects all future pictures for which that particular picture serves as a reference. This is because the reference frame that the decoder can construct for future prediction is not the same as the reference frame used in the encoder. Subsequent differences or drift can have a profound effect on the visual quality of the decoded video signal. However, as described in International Patent Applications Nos. PCT / US06 / 28365 and PCT / US06 / 061815, structure 400 (FIG. 4) has an obvious advantage in terms of robustness in the presence of transmission errors. Have.

図4に示されているように、スレッディング構造400は、3つの自己完結型の依存性チェーンを作成する。L2ピクチャで発生するパケット消失は、L2ピクチャだけに影響し、L0ピクチャおよびL1ピクチャは、それでも、復号し、表示することができる。同様に、L1ピクチャで発生するパケット消失は、L1ピクチャおよびL2ピクチャだけに影響し、L0ピクチャは、それでも、復号し、表示することができる。 As shown in FIG. 4, the threading structure 400 creates three self-contained dependency chains. Packet loss that occurs in the L2 picture affects only the L2 picture, and the L0 and L1 pictures can still be decoded and displayed. Similarly, packet loss that occurs in the L1 picture affects only the L1 and L2 pictures, and the L0 picture can still be decoded and displayed.

スレッドの同一のエラー封じ込め(error containment)特性が、S個のパケットまで及ぶ。たとえば、構造700(図7)を用いると、S2ピクチャで発生する消失は、その特定のピクチャだけに影響するが、S1ピクチャでの消失は、続くS2ピクチャにも影響する。どちらの場合でも、ドリフトは、次のS0ピクチャの復号時に終了する。 The same error containment property of the thread extends to S packets. For example, using structure 700 (FIG. 7), the erasure that occurs in an S2 picture affects only that particular picture, but the erasure in an S1 picture also affects subsequent S2 pictures. In either case, the drift ends when the next S0 picture is decoded.

スレッド式構造によって、ベース層ピクチャおよび一部の拡張層ピクチャが、その配送が保証される形で送信される場合には、残りの層を、パケット消失の場合に破滅的な結果を伴うことなく、ベストエフォート・ベースで送信することができる。必要となる保証付き送信は、DiffServ、FEC技法、または当技術分野で既知の他の適切な技法を使用して実行することができる。本明細書での説明では、保証付き伝送およびベストエフォート型伝送は、そのような差別化されたクオリティ・オブ・サービスを提供する2つの実際のチャネルまたは仮想チャネル(たとえば、それぞれ高信頼性チャネル(High Reliability Channel (HRC))および低信頼性チャネル(Low Reliability Channel (LRC))上で起こると仮定している(たとえば、国際特許出願第PCT/US06/028366号および第PCT/US06/061815号を参照されたい)。 If the threaded structure causes the base layer picture and some enhancement layer pictures to be transmitted in a way that guarantees their delivery, the remaining layers will not have catastrophic consequences in case of packet loss. Can be transmitted on a best-effort basis. The required guaranteed transmission can be performed using DiffServ, FEC techniques, or other suitable techniques known in the art. In the description herein, guaranteed transmission and best effort transmission are two real or virtual channels that provide such differentiated quality of service (e.g., reliable channels (e.g., (E.g., International Patent Applications PCT / US06 / 028366 and PCT / US06 / 061815 are assumed to occur on High Reliability Channel (HRC)) and Low Reliability Channel (LRC). See)

たとえば、層L0〜L2およびS0が、HRC上で送信され、S1およびS2が、LRC上で送信されると考えられたい。S1またはS2パケットの消失は、限られたドリフトを引き起こすが、それでも、情報の消失をできる限り隠蔽できることが望ましいはずである。消失したS1ピクチャまたはS2ピクチャの隠蔽は、デコーダが使用可能な情報すなわち過去のS2ピクチャおよび現行ピクチャのベース層のコーディングされた情報だけを使用することができる。 For example, consider that layers L0-L2 and S0 are transmitted on HRC, and S1 and S2 are transmitted on LRC. The loss of S1 or S2 packets causes limited drift, but it should still be desirable to be able to hide the loss of information as much as possible. The concealment of the lost S1 picture or S2 picture can use only the information available to the decoder, ie, the past S2 picture and the base layer coded information of the current picture.

本発明による例示的な隠蔽技法は、消失した拡張層フレームのベース層情報を利用し、これを拡張層の復号ループ内で使用する。使用できるベース層情報は、動きベクトルデータ(ターゲット層解像度のために適切にスケーリングされる)、コーディングされた予測誤差の差分(必要な場合、拡張層解像度のためにアップサンプリングされる)、およびイントラデータ(必要な場合、拡張層解像度のためにアップサンプリングされる)を含む。前のピクチャからの予測基準は、必要な時に、対応するベース層ピクチャではなく拡張層解像度ピクチャからとられる。このデータは、デコーダが、欠けているフレームの非常によい近似を再構成し、したがって欠けているフレームに対する実際のひずみおよび知覚されるひずみを最小にすることを可能にする。さらに、今や、欠けているフレームのよい近似が利用できるので、任意の依存フレームの復号も可能である。 An exemplary concealment technique according to the present invention utilizes the base layer information of the lost enhancement layer frame and uses it in the enhancement layer decoding loop. Available base layer information includes motion vector data (scaled appropriately for target layer resolution), coded prediction error difference (upsampled for enhancement layer resolution, if necessary), and intra Contains data (upsampled for enhancement layer resolution if necessary). The prediction criteria from the previous picture is taken from the enhancement layer resolution picture instead of the corresponding base layer picture when necessary. This data allows the decoder to reconstruct a very good approximation of the missing frame, thus minimizing the actual and perceived distortion for the missing frame. In addition, a good approximation of the missing frame is now available, so that any dependent frame can be decoded.

図8に、解像度QCIFおよびCIFと2つの予測スレッド(L0/S0およびL1/S1)とを有する2層の空間スケーラビリティ符号化信号の例を使用する、隠蔽復号処理800の例示的ステップ810〜840を示す。処理800が、他の解像度および図示とは異なる個数のスレッドに適用可能であることを理解されたい。この例では、コーディングされたデータの到着ステップ810で、L0、S0、およびL1のコーディングされたデータが、無傷で受信端末に到着するが、S1のコーディングされたデータが失われると仮定する。さらに、時刻t0に対応するピクチャの前のピクチャのすべてのコーディングされたデータも、受信端末で受信済みであると仮定する。したがって、デコーダは、時刻t0にQCIFピクチャとCIFピクチャとの両方を正しく復号することができる。デコーダは、さらに、L0およびL1に含まれる情報を使用して、時刻t1に対応する正しく復号されたL1ピクチャを再構成することができる。 FIG. 8 illustrates exemplary steps 810-840 of a concealment decoding process 800 using an example of a two-layer spatial scalability encoded signal with resolution QCIF and CIF and two prediction threads (L0 / S0 and L1 / S1). Indicates. It should be understood that process 800 is applicable to other resolutions and a different number of threads than shown. In this example, it is assumed that the coded data arrival step 810, the L0, S0, and L1 coded data arrives intact at the receiving terminal, but the S1 coded data is lost. Further, it is assumed that all the coded data of the picture preceding the picture corresponding to the time t0 has been received by the receiving terminal. Therefore, the decoder can correctly decode both the QCIF picture and the CIF picture at time t0. The decoder can further use the information contained in L0 and L1 to reconstruct a correctly decoded L1 picture corresponding to time t1.

図8には特定の例が示されており、ここで、時刻t1のL1ピクチャのブロックLB1が、動きベクトルLMV1、および動き補償予測に加算されなければならない残差LRES1を用いる動き補償予測を使用することによって、ベース層復号ステップ820で符号化される。LMV1およびLRES1のデータは、受信端末によって受信されるL1データに含まれる。復号処理は、前のベース層ピクチャ(L0ピクチャ)からのブロックLB0を必要とし、これは、通常の復号処理の結果としてデコーダで使用可能である。S1データは、この例では失われると仮定しているので、デコーダは、拡張層ピクチャを復号するために、対応する情報を使用することができない。 FIG. 8 shows a specific example, where a block LB1 of the L1 picture at time t1 uses motion compensated prediction using the motion vector LMV1 and the residual LRES1 that must be added to the motion compensated prediction Thus, encoding is performed in the base layer decoding step 820. The data of LMV1 and LRES1 is included in the L1 data received by the receiving terminal. The decoding process requires a block LB0 from the previous base layer picture (L0 picture), which can be used at the decoder as a result of the normal decoding process. Since S1 data is assumed to be lost in this example, the decoder cannot use the corresponding information to decode the enhancement layer picture.

隠蔽復号処理800は、拡張層ブロックSB1の近似を構成する。隠蔽データ生成ステップ830で、処理800は、対応するベース層ブロックLB1のコーディングされたデータ、この例ではLMV1およびLRES1を入手することによって隠蔽データを生成する。次に、処理800は、動きベクトルを拡張層の解像度にスケーリングして、拡張層動きベクトルSMV1を構成する。検討している2層ビデオ信号の例では、SMV1は、スケーリング可能な信号の解像度の比率が2なので、LMV1の2倍と等しい。さらに、隠蔽復号処理800は、ベース層残差信号を拡張層の解像度に、各次元で2倍だけアップサンプリングし、その後、任意選択で、サンプルレート変換処理の周知の原理に従って、フィルタLPFを用いて結果を低域フィルタリングする。隠蔽データ生成ステップ830のさらなる結果は、残差信号SRES1である。次のステップ840(隠蔽を伴う拡張層の復号処理)は、構成された隠蔽データSMV1およびSRES1を使用して、ブロックSB1を近似する。この近似は、前の拡張層ピクチャからのブロックSB0を必要とし、この前の拡張層ピクチャが、拡張層の通常の復号処理の結果としてデコーダで使用可能と仮定されることに留意されたい。異なる符号化モードは、同一の形または類似する形で動作することができる。 The concealment decoding process 800 constitutes an approximation of the enhancement layer block SB1. At concealment data generation step 830, process 800 generates concealment data by obtaining the coded data of the corresponding base layer block LB1, in this example LMV1 and LRES1. Next, the process 800 scales the motion vector to the enhancement layer resolution to construct the enhancement layer motion vector SMV1. In the example of a two-layer video signal under consideration, SMV1 is equal to twice LMV1 because the ratio of the resolution of the scalable signal is 2. Furthermore, the concealment decoding process 800 upsamples the base layer residual signal to the enhancement layer resolution by a factor of two in each dimension, and then optionally uses a filter LPF according to the well-known principles of the sample rate conversion process. Filter the results. A further result of the concealment data generation step 830 is the residual signal SRES1. The next step 840 (enhancement layer decoding with concealment) approximates the block SB1 using the concealed data SMV1 and SRES1. Note that this approximation requires block SB0 from the previous enhancement layer picture, and this previous enhancement layer picture is assumed to be usable at the decoder as a result of the normal decoding process of the enhancement layer. Different encoding modes can operate in the same or similar manner.

発明の隠蔽技法のさらなる例示的応用例は、高解像度イメージの例に関する。高解像度イメージ(たとえば、CIFより高い)では、しばしば、複数のMTU (maximum transmission unit)が、拡張層のフレームを伝送するために必要である。単一MTUサイズのパケットの成功の伝送の可能性がpである場合に、n個のMTUからなるフレームの成功の伝送の可能性は、pⁿである。従来のように、そのようなフレームを表示するためには、n個のパケットのすべてを成功して配送しなければならない。 A further exemplary application of the inventive concealment technique relates to the example of a high resolution image. In high resolution images (eg, higher than CIF), multiple MTUs (maximum transmission units) are often needed to transmit enhancement layer frames. If the probability of successful transmission of a single MTU size packet is p, the probability of successful transmission of a frame consisting of n MTUs is ^pn . As before, in order to display such a frame, all n packets must be successfully delivered.

発明の隠蔽技法の応用例では、S層フレームが、エンコーダで伝送のためにMTUサイズのスライスに分解される。デコーダ側では、受信されたSピクチャから使用可能なスライスであれば、どのスライスでも使用される。欠けているスライスは、隠蔽方法(たとえば、処理800)を使用して補償され、したがって、全体的なひずみが減る。 In an application of the inventive concealment technique, S-layer frames are broken down into MTU-sized slices for transmission at the encoder. On the decoder side, any slice that can be used from the received S picture is used. Missing slices are compensated using a concealment method (eg, process 800), thus reducing overall distortion.

研究室の実験では、この隠蔽技法は、有効通信レート(総レート引く消失レート)での直接コーディングと比較した時に、類似する性能またはよりよい性能をもたらした。この実験について、層L0〜L2が、HRC上で確実に伝送されるが、層S1およびS2が、LRC上で伝送されると仮定した。Y-PSNRに関する実際の品質損失は、5%のパケット消失あたり0.2〜0.3dBの範囲内であり、明らかに、フレームコピーまたは動き補償フレームコピーなどの他の既知の隠蔽技法の性能をしのいだ(たとえば、S. Bandyopadhyay、Z. Wu、P. Pandit、およびJ. Boyce、「Frame Loss Error Concealment for H.264/AVC」、Doc. JVT-P072、ポーランド国ポズナム、2005年7月を参照されたい。彼らは、IPP…PI構造と1秒のI周期とを用いる単一層AVCコーディングの評価に於いて消失レート5%でも、数dBの損失を報告している)。この研究室実験結果は、本技法が、スケーラブルコーデックでエラー耐性を提供するのに有効であることを実証するものである。 In laboratory experiments, this concealment technique yielded similar or better performance when compared to direct coding at the effective communication rate (total rate minus erasure rate). For this experiment, it was assumed that layers L0-L2 were transmitted reliably on HRC, while layers S1 and S2 were transmitted on LRC. The actual quality loss for Y-PSNR is in the range of 0.2-0.3 dB per 5% packet loss, clearly surpassing the performance of other known concealment techniques such as frame copy or motion compensated frame copy ( For example, see S. Bandyopadhyay, Z. Wu, P. Pandit, and J. Boyce, “Frame Loss Error Concealment for H.264 / AVC”, Doc. JVT-P072, Poznam, Poland, July 2005. They have reported several dB of loss even at an erasure rate of 5% in the evaluation of single-layer AVC coding using an IPP ... PI structure and an I period of 1 second). The laboratory experimental results demonstrate that this technique is effective in providing error resilience with scalable codecs.

図9に、異なるQPを有する標準「フォアマン」ビデオテストシーケンスを使用して得られたレート-ひずみ曲線を示す。QPごとに、レート-ひずみ値が、上で説明した発明のエラー隠蔽技法を適用しながら、異なる量のS1フレームおよびS2フレームを捨てることによって入手された。図9からわかるように、各QP曲線の右端の点は、消失なしに対応し、その後(右から左への方向で)、S2の50%の脱落、S2の100%の脱落、S2の100%およびS1の50%の脱落、ならびにS1およびS2の100%の脱落に対応する。異なるQPの0消失点を接続することによって得られる、コーデックのR-D曲線がオーバーレイされている。図9から、特に30未満のQPに関するさまざまな曲線が、R-D曲線に近いが、いくつかの場合により高いことがわかる。この差は、使用される基本的なコーデックのさらなる最適化によって除去されると期待される。 FIG. 9 shows the rate-distortion curve obtained using the standard “Foreman” video test sequence with different QPs. For each QP, rate-distortion values were obtained by discarding different amounts of S1 and S2 frames while applying the inventive error concealment technique described above. As can be seen from Figure 9, the rightmost point of each QP curve corresponds to no disappearance, then (in the direction from right to left), then S2 50% dropout, S2 100% dropout, S2 100 Corresponds to 50% dropout of% and S1, and 100% dropout of S1 and S2. The codec R-D curve obtained by connecting the 0 vanishing points of different QPs is overlaid. From FIG. 9, it can be seen that the various curves, especially for QPs less than 30, are close to the RD curve but are higher in some cases. This difference is expected to be removed by further optimization of the basic codec used.

研究室の実験結果は、Y-PSNRが、有効伝送レートで動作する同一のエンコーダのY-PSNRと同じことを示す。これは、本隠蔽技法をレート制御のために有利に使用できることを暗示する。有効伝送レートは、消失レートを引いた伝送レートすなわち、宛先に実際に到着するパケットに基づいて計算されるレートと定義される。S1フレームおよびS2フレームに対応するビットレートは、通常は、特定のコーディング構造について全体の30%であり、これは、70%と100%との間の任意のビットレートを、レート制御のために選択された個数のS1およびS2フレームを除去することによって達成できることを暗示する。70%と100%との間のビットレートは、所与の期間内に捨てられるS2フレームまたはS1およびS2フレームの個数を選択することによって達成することができる。 Laboratory experimental results show that Y-PSNR is the same as Y-PSNR for the same encoder operating at an effective transmission rate. This implies that the concealment technique can be advantageously used for rate control. The effective transmission rate is defined as the transmission rate minus the erasure rate, that is, the rate calculated based on the packets that actually arrive at the destination. The bit rate corresponding to S1 and S2 frames is typically 30% of the total for a particular coding structure, which allows any bit rate between 70% and 100% for rate control. It implies that it can be achieved by removing a selected number of S1 and S2 frames. A bit rate between 70% and 100% can be achieved by selecting the number of S2 frames or S1 and S2 frames that are discarded within a given time period.

レート制御に関するさらにより広い範囲を、たとえば国際特許出願第PCT/US06/061815号に記載のLR/SRピクチャを使用するピクチャコーディング構造について得ることができる。そのようなピクチャ構造を用いると、HRCでS0を伝送するのではなく、HRCでより低い時間分解能のSRだけを含めることが可能である。この特徴は、レート制御に関するより広い範囲を可能にする。 An even wider range for rate control can be obtained for picture coding structures using LR / SR pictures as described, for example, in International Patent Application No. PCT / US06 / 061815. With such a picture structure, it is possible to include only SR with lower temporal resolution in HRC, rather than transmitting S0 in HRC. This feature allows a wider range for rate control.

表1に、通常のビデオシーケンス(たとえば、空間スケーラビリティ、QCIF-CIF解像度、3層スレッディング、380Kbps)の異なるフレームタイプのレートパーセンテージを要約する。

Table 1 summarizes the rate percentages for different frame types of normal video sequences (eg, spatial scalability, QCIF-CIF resolution, 3 layer threading, 380 Kbps).

異なるフレームタイプを組み合わせることによって、本隠蔽技法は、実際上すべての所望のレートを達成することができる。たとえば、L0〜L2ピクチャおよびS0ピクチャのすべてが含まれ、10個のS1ピクチャのうちの1つだけが捨てられる時に、全体の約72+1.8=73.8%のレートを達成することができる。微粒度スケーラビリティ(Fine Granularity Scalability(FGS))などの当技術分野で既知の代替技法は、類似するレート柔軟性を達成することを試みるが、非常に悪いレート-ひずみ性能およびかなりの計算オーバーヘッドを伴う。本発明の隠蔽技法は、FGSに関連するレートスケーラビリティを提供するが、そのような技法に関連するコーディング効率ペナルティを伴わない。 By combining different frame types, the concealment technique can achieve virtually any desired rate. For example, a rate of about 72 + 1.8 = 73.8% of the total can be achieved when all of the L0-L2 and S0 pictures are included and only one of the 10 S1 pictures is discarded. Alternative techniques known in the art, such as Fine Granularity Scalability (FGS), attempt to achieve similar rate flexibility but with very poor rate-distortion performance and significant computational overhead . The concealment technique of the present invention provides rate scalability associated with FGS, but without the coding efficiency penalty associated with such techniques.

ビデオ伝送からのS1およびS2フレームの意図的な除去は、エンコーダでまたは使用可能な中間ゲートウェイ(たとえば、SVCS/CSVCS)のいずれかで実行することができる。 Intentional removal of S1 and S2 frames from video transmission can be performed either at the encoder or at an available intermediate gateway (eg, SVCS / CSVCS).

さらに、レート制御を達成するための本発明の隠蔽技法の応用例を、例示のみのために、2層構造内のS1フレームの消失に関して本明細書で説明したことを理解されたい。実際には、本技法は、特定のスレッディング構造に限定されるのではなく、ピラミッド型時間構造(たとえば、複数の品質レベルまたは空間レベルを含む構造、異なる時間構造など)を使用する任意の空間スケーラブルコーデックに適用することができる。 Further, it should be understood that an example application of the concealment technique of the present invention to achieve rate control has been described herein with reference to erasure of S1 frames in a two-layer structure for illustrative purposes only. In practice, the technique is not limited to a particular threading structure, but any spatial scalable that uses pyramidal temporal structures (e.g., structures with multiple quality levels or spatial levels, different temporal structures, etc.) It can be applied to codecs.

発明の隠蔽技法のさらなる用途は、2つのコーディングされた解像度の間にある解像度でビデオ信号を表示することである。たとえば、ビデオ信号が、空間スケーラブルコーデックを使用してQCIF解像度およびCIF解像度でコーディングされると仮定する。ユーザが、その出力を1/2 CIF解像度(HCIF)で表示することを望む場合に、従来のデコーダは、2つの手法すなわち、1)QCIF信号を復号し、HCIFにアップサンプリングするか、2)CIF信号を復号し、HCIFにダウンサンプリングするかのいずれかに従うはずである。第1の場合に、HCIF画質はよくないが、使用されるビットレートは低い。第2の場合に、品質を非常によくすることができるが、使用されるビットレートは、第1の手法で必要なビットレートのほぼ2倍になる。従来のデコーダのこれらの短所は、発明のエラー隠蔽技法によって克服される。 A further use of the inventive concealment technique is to display a video signal at a resolution that is between the two coded resolutions. For example, assume that a video signal is coded with QCIF resolution and CIF resolution using a spatial scalable codec. If the user wants to display its output at 1/2 CIF resolution (HCIF), the conventional decoder has two approaches: 1) decode the QCIF signal and upsample to HCIF, or 2) It should follow either decoding the CIF signal and downsampling to HCIF. In the first case, the HCIF image quality is not good, but the bit rate used is low. In the second case, the quality can be very good, but the bit rate used is almost twice that required by the first approach. These disadvantages of conventional decoders are overcome by the inventive error concealment technique.

たとえば、S1およびS2フレームのすべてを意図的に破棄することは、本明細書で説明するS1/S2エラー隠蔽技法を適用することによって非常にわずかな品質低下を伴って、大幅な帯域幅削減をもたらすことができる。得られた復号CIF信号をダウンサンプリングすることによって、HCIF信号の非常によい表現(rendition)が得られる。別々の単一層ストリームがQCIF解像度およびCIF解像度で送信される従来の同時放送技法は、フレームレートをも低下させない限り、使用に適したビットレートでの中間解像度の信号のそのような導出はできないことに留意されたい。発明の隠蔽技法は、使用に適したビットレートでの中間解像度信号の導出に空間スケーラブルコーディングを活用する。 For example, deliberately discarding all S1 and S2 frames can result in significant bandwidth reduction with very little quality degradation by applying the S1 / S2 error concealment techniques described herein. Can bring. By downsampling the resulting decoded CIF signal, a very good rendition of the HCIF signal is obtained. Traditional simultaneous broadcast techniques, where separate single layer streams are transmitted at QCIF and CIF resolution, do not allow such derivation of intermediate resolution signals at a bit rate suitable for use unless the frame rate is also reduced. Please note that. The inventive concealment technique exploits spatial scalable coding to derive an intermediate resolution signal at a bit rate suitable for use.

実際には、中間解像度を導出するための発明の隠蔽技法の応用例は、フル解像度でのS0に関する拡張層復号ループの動作を必要とする。この復号は、復号された予測誤差の生成と、フル解像度での動き補償の適用との両方を伴う。計算要件を減らすために、復号された予測誤差だけをフル解像度で生成し、その後、ターゲット解像度(たとえば、HCIF)にダウンサンプリングすることができる。次に、下げられた解像度の信号を、適切にスケーリングされた動きベクトルおよび残差情報を使用して動き補償することができる。この技法は、受信器への送信のために保持される「S」層のどの部分に対しても使用することができる。拡張層復号ループ内で導入されるドリフトがあるので、ドリフトを周期的に除去する機構が必要になる場合がある。Iフレームなどの標準技法に加えて、拡張層マクロブロック毎に空間スケーラビリティのINTRA_BLモードの周期的使用を採用することができ、このモードでは、ベース層からの情報だけが予測に使用される(たとえば、PCT/US06/28365を参照されたい)。時間情報が使用されないので、その特定のマクロブロックに関するドリフトが除去される。SRピクチャが使用される場合、すべてのSRピクチャをフル解像度で復号することによって、ドリフトを除去することもできる。SRピクチャは、遠く離れているので、それでも、計算の複雑さにおけるかなりの利益が存在し得る。いくつかの場合に、中間解像度信号を導出する技法を、拡張層デコーダループを下げられた解像度で動作させることによって修正することができる。CPUリソースが律速因子ではなく、SR分離より早い切替が必要であるか又は望まれる場合、同一の技法(すなわち、デコーダループをフル解像度で動作させる)を、必要に応じてより高い時間レベル(たとえば、S0)に適用することができる。 In practice, the application of the inventive concealment technique for deriving the intermediate resolution requires the operation of the enhancement layer decoding loop for S0 at full resolution. This decoding involves both the generation of decoded prediction errors and the application of motion compensation at full resolution. To reduce computational requirements, only the decoded prediction error can be generated at full resolution and then downsampled to the target resolution (eg, HCIF). The reduced resolution signal can then be motion compensated using appropriately scaled motion vectors and residual information. This technique can be used for any portion of the “S” layer that is retained for transmission to the receiver. Since there is a drift introduced in the enhancement layer decoding loop, a mechanism to periodically remove the drift may be required. In addition to standard techniques such as I-frame, the periodic use of the spatial scalability INTRA_BL mode can be adopted per enhancement layer macroblock, in which only information from the base layer is used for prediction (e.g. See PCT / US06 / 28365). Since no time information is used, drift for that particular macroblock is removed. If SR pictures are used, drift can also be eliminated by decoding all SR pictures at full resolution. Since SR pictures are far away, there can still be a significant benefit in computational complexity. In some cases, the technique for deriving the intermediate resolution signal can be modified by operating the enhancement layer decoder loop at a reduced resolution. If CPU resources are not the rate limiting factor and switching faster than SR separation is required or desired, the same technique (i.e., operating the decoder loop at full resolution) can be applied to higher time levels as needed (e.g. , S0).

発明の隠蔽技法のもう1つの例示的応用は、空間レベルまたは品質レベルが同時放送を介して達成されるビデオ会議システムへの応用である。この場合、隠蔽は、上で説明したようにベース層情報を使用して実行される。拡張層のドリフトは、a)スレッディング、b)標準SVC時間スケーラビリティ、c)周期的Iフレーム、およびd)周期的イントラマクロブロックのうちのいずれか1つを介して除去することができる。 Another exemplary application of the inventive concealment technique is in video conferencing systems where the spatial level or quality level is achieved via simultaneous broadcasting. In this case, concealment is performed using the base layer information as described above. Enhancement layer drift can be removed via any one of a) threading, b) standard SVC time scalability, c) periodic I-frames, and d) periodic intra macroblocks.

空間スケーラビリティを提供するために同時放送を利用しており、特定のストリームについて特定の宛先用のより高い解像度の情報だけを送信している(たとえば、エラーがないかほとんどないと仮定する場合)SVCS/CSVCSは、デコーダのそのようなエラー隠蔽機構を予想し、上で述べたようにドリフトを除去するのに時間スケーラビリティに頼ることによって、高解像度の欠けているフレームを低解像度のフレームと置換することができる。説明した隠蔽処理を、そのようなシステムでの有効レート制御をもたらすように簡単に適合できることを理解されたい。 SVCS using simultaneous broadcasts to provide spatial scalability and sending only higher resolution information for a particular destination for a particular stream (e.g. assuming no or few errors) / CSVCS anticipates such an error concealment mechanism in the decoder and replaces high resolution missing frames with lower resolution frames by relying on temporal scalability to eliminate drift as described above be able to. It should be understood that the concealment process described can be easily adapted to provide effective rate control in such systems.

より高い解像度のフレームを破棄するかその消失を検出する責任を負うSVCS、CSVCS、またはエンコーダが、そのようなフレームを受信するデコーダが本明細書に記載の隠蔽方法を備えていると仮定できない場合には、そのような実体は、次の方法のうちの1つによって、類似する機能性を達成する置換高解像度フレームを作成することができる。 If the SVCS, CSVCS, or encoder responsible for discarding higher resolution frames or detecting their loss cannot assume that the decoder receiving such frames has the concealment method described herein. In turn, such entities can create replacement high-resolution frames that achieve similar functionality in one of the following ways:

a)空間スケーラビリティコーディングでのエラー耐性のために、追加の残差または動きベクトルの洗練を一切伴わずに、アップサンプリングされたベース層情報を使用するための適切なシグナリングだけを含むより低い解像度のフレームの構文解析に基づいて、合成フレームを作成する、
b)空間スケーラビリティを使用するシステムでのレート制御のために、(a)に記載の方法と、オリジナル高解像度フレームからの重要な情報を含むいくつかのマクロブロック(MB)が維持されることの追加との組合せ、
c)空間スケーラビリティのために同時放送を使用するエラー耐性のあるシステムのために、アップサンプリングされた動きベクトルおよび残差情報を含む合成MBを含む置換高解像度フレームを作成する、
d)空間スケーラビリティのために同時放送を使用するシステムでのレート制御のために、オリジナル高解像度フレームからの重要な情報を含むいくつかのMBが維持されることを追加された、(c)に記載の方法。 a) For error resilience in spatial scalability coding, lower resolution including only proper signaling to use upsampled base layer information without any additional residual or motion vector refinement Create composite frames based on frame parsing,
b) For rate control in systems using spatial scalability, the method described in (a) and that several macroblocks (MB) containing important information from the original high resolution frame are maintained. In combination with additional,
c) Create a permutation high-resolution frame containing a composite MB containing upsampled motion vectors and residual information for an error tolerant system that uses simultaneous broadcast for spatial scalability;
d) Added that some MB containing important information from the original high-resolution frame is maintained for rate control in systems that use simultaneous broadcasting for spatial scalability, to (c) The method described.

上の事例a)およびb)では、ベース層ピクチャのアップサンプリングされた版だけを使用するためのシグナリングは、コーディングされたビデオビットストリームを介して帯域内で、またはエンコーダもしくはSVCS/CSVCSから受信端末に送信される帯域外情報を介してのいずれかで実行することができる。帯域内シグナリングの場合は、拡張層MBのうちのいくつかまたはすべてについてベース層情報だけを使用するようにデコーダに指示するために、コーディングされたビデオビットストリーム内に特有の構文要素が存在しなければならない。SVC仕様のJD7版に基づき(参照によってその全体が本明細書に組み込まれるT. Wiegand、G. Sullivan、J. Reichel、H. Schwarz、M. Wien編集、「Joint Draft 7, Rev. 2: Scalable Video Coding」、Joint Video Team, Doc. JVT-T201、クラーゲンフルト、2006年7月を参照されたい)、米国仮特許出願第60/862,510号に記載の本発明の例示的コーデックでは、マクロブロックがコーディングされない時にベース層データを利用する特有の予測モードを使用しなければならないことを示すために、1組のフラグをスライスヘッダに導入することができる。すべての拡張層マクロブロックをスキップすることによって、エンコーダまたはSVCS/CSVCSは、実際にS1またはS2フレームを除去するが、これらを、デフォルト予測モード、およびすべてのマクロブロックがスキップされるという事実を示すのに必要な少数のバイトだけを含む非常に小さいデータパケットと置換する。同様に、レート制御を実行するために、エンコーダまたはSVCS/SVCSは、拡張層MBからいくつかの情報を選択的に除去することができる。たとえば、エンコーダまたはSVCS/SVCSは、動きベクトル洗練を選択的に維持するが、残差予測を除去することができ、あるいは、残差予測を保持するが、動きベクトル洗練を除去することができる。 In cases a) and b) above, the signaling for using only the upsampled version of the base layer picture is received in-band via the coded video bitstream or from the encoder or SVCS / CSVCS. This can be done either via out-of-band information transmitted to. In the case of in-band signaling, there must be a specific syntax element in the coded video bitstream to instruct the decoder to use only base layer information for some or all of the enhancement layer MBs. I must. Based on the JD7 version of the SVC specification (edited by T. Wiegand, G. Sullivan, J. Reichel, H. Schwarz, M. Wien, which is incorporated herein by reference in its entirety, "Joint Draft 7, Rev. 2: Scalable. `` Video Coding '', Joint Video Team, Doc. JVT-T201, Klagenfurt, July 2006), in the exemplary codec of the present invention described in US Provisional Patent Application No. 60 / 862,510, macroblocks A set of flags can be introduced in the slice header to indicate that a unique prediction mode that utilizes base layer data must be used when not coded. By skipping all enhancement layer macroblocks, the encoder or SVCS / CSVCS actually removes the S1 or S2 frames, but these indicate the default prediction mode, and the fact that all macroblocks are skipped Replace with a very small data packet containing only the few bytes needed to Similarly, to perform rate control, the encoder or SVCS / SVCS can selectively remove some information from the enhancement layer MB. For example, an encoder or SVCS / SVCS selectively maintains motion vector refinement, but can remove residual prediction, or can retain residual prediction but remove motion vector refinement.

SVC JD7仕様の参照を続けると、MB層(スケーラブルエクステンション内)には、ベース層が存在する場合、ベース層からの情報を予測するのに使用される複数のフラグがある。これらは、base_mode_flag、motion_prediction_flag、およびresidual_prediction_flagである。同様に、スライスヘッダ内に、MB層内のbase_mode_flagの存在を示すのに使用されるフラグadaptive_prediction_flagが既に存在する。隠蔽動作をトリガするために、すべてのMBについてbase_mode_flagに1をセットする必要があり、これは、既に存在するadaptive_prediction_flagを使用して行うことができる。スライスヘッダフラグadaptive_prediction_flagに0をセットし、インターMBのresidue_prediction_flagのデフォルト値が1であることを考慮に入れることによって、我々は、あるスライス内のすべてのMBがスキップされることを示し(mb_skip_runまたはmb_skip_flagシグナリングを使用して)、したがって本明細書で開示される隠蔽動作を不可欠に実行するようにデコーダに指示することができる。 Continuing to refer to the SVC JD7 specification, the MB layer (in the scalable extension) has multiple flags that are used to predict information from the base layer, if the base layer exists. These are base_mode_flag, motion_prediction_flag, and residual_prediction_flag. Similarly, a flag adaptive_prediction_flag used to indicate the presence of base_mode_flag in the MB layer already exists in the slice header. In order to trigger the concealment operation, it is necessary to set base_mode_flag to 1 for all MBs, which can be done using the existing adaptive_prediction_flag. By setting the slice header flag adaptive_prediction_flag to 0 and taking into account that the default value of the inter MB's residue_prediction_flag is 1, we indicate that all MBs in a slice are skipped (mb_skip_run or mb_skip_flag (Using signaling) and thus can instruct the decoder to perform the concealment operations disclosed herein indispensably.

隠蔽技法の潜在的短所は、S0フレームが通常は非常に大きい(たとえば、総帯域幅の45%)ので、S1フレームおよびS2フレームなしのコーディングされたストリームのビットレートが、非常に不均一または「バースト的」になる場合があることであると認められる。この挙動を軽減するために、修正形態(以下では「漸進的隠蔽」)では、S0パケットを、これらをより小さいパケットおよび/またはスライスに分割し、その送信を連続するS0ピクチャ間の時間間隔に渡って拡散させることによって送信することができる。S0ピクチャ全体は、最初のS2ピクチャについて使用可能ではなくなるが、最初のS2ピクチャによって受信された情報(すなわち、S0の諸部分ならびにL0およびL2の全体)を、隠蔽目的に使用することができる。この形で、デコーダは、L1/S1ピクチャを表示するのに間に合って、適切な基準フレームを回復することもでき、これは、L1/S1ピクチャと第2のL2/S2との両方の復号された版を作成する際にさらに役立つ。そうでない場合に、これらは、L0フレームからさらに離れているので、動きに起因するより多くの隠蔽アーチファクトを示す可能性がある。 The potential disadvantage of the concealment technique is that the S0 frame is usually very large (e.g. 45% of the total bandwidth), so the bit rate of the coded stream without S1 and S2 frames is very uneven or `` It is recognized that it can be “burst”. To alleviate this behavior, a modified form (hereinafter “gradual concealment”) splits the S0 packets into smaller packets and / or slices and divides the transmission into time intervals between consecutive S0 pictures. Can be transmitted by spreading across. The entire S0 picture will not be usable for the first S2 picture, but the information received by the first S2 picture (ie, parts of S0 and the whole of L0 and L2) can be used for concealment purposes. In this way, the decoder can also recover the appropriate reference frame in time to display the L1 / S1 picture, which is decoded by both the L1 / S1 picture and the second L2 / S2 This is even more useful when creating new versions. Otherwise, they may be more distant from the L0 frame and thus show more concealment artifacts due to motion.

バースト的S0伝送の影響を軽減するもう1つの代替解決策は、エンドツーエンド遅延の増加という犠牲を払う追加バッファリングによって、可変ビットレート(VBR)トラフィックを平滑化することである。マルチポイント会議アプリケーションでは、サーバ側に固有の統計的多重化があることに留意されたい。したがって、サーバから発するトラフィックのVBR挙動は、自然に平滑化される。 Another alternative solution to mitigate the impact of bursty S0 transmission is to smooth variable bit rate (VBR) traffic with additional buffering at the expense of increased end-to-end delay. Note that in multipoint conferencing applications, there is statistical multiplexing inherent on the server side. Therefore, the VBR behavior of traffic originating from the server is naturally smoothed.

国際特許出願第PCT/US06/061815号は、エラー耐性およびランダムアクセスの問題を説明し、異なる応用シナリオに適当な解決策を提供する。 International Patent Application No. PCT / US06 / 061815 describes the problem of error resilience and random access and provides a suitable solution for different application scenarios.

漸進的隠蔽技法は、ビデオ切替の実行のためにさらなる解決策を提供する。上で説明した漸進的隠蔽技法をも、ビデオ切替に使用することができる。例示的な切替応用例が、3層スレッディング構造を有するQCIFおよびCIF解像度でコーディングされた単一ループ空間スケーラブル信号への応用であり、この3層スレッディング構造は図7に示されている。国際特許出願第PCT/US06/061815号に記載されているように、L0ピクチャのうちのいくつかの確実な伝送を保証することによって、エラー耐性の強化を達成することができる。確実に伝送されるL0ピクチャを、LRピクチャと称する。同一のスレッディングパターンを、図10に示されているようにSピクチャに拡張することができる。Sピクチャの時間予測経路は、Lピクチャの時間予測経路と同一である。図10には、例示のために1/3(すべての3つのS0ピクチャのうちの1つがSRである)の例示的SR期間を示す。実際には、異なる期間および異なるスレッディングパターンを、本発明の原理に従って使用することができる。さらに、SピクチャおよびLピクチャ内の異なる経路を使用することもできるが、Sピクチャのコーディング効率の低下を伴う。LRピクチャと同様に、SRピクチャは、確実に伝送されると仮定される。国際特許出願第PCT/US06/061815号に記載されているように、これは、DiffServコーディング(LRおよびSRがHRC内にある)、FEC、またはARQなど、複数の技法を使用して達成することができる。 The progressive concealment technique provides a further solution for performing video switching. The progressive concealment technique described above can also be used for video switching. An exemplary switching application is an application to a single-loop spatial scalable signal coded at QCIF and CIF resolution with a three-layer threading structure, which is illustrated in FIG. As described in International Patent Application No. PCT / US06 / 061815, enhanced error tolerance can be achieved by ensuring reliable transmission of some of the L0 pictures. An L0 picture that is reliably transmitted is referred to as an LR picture. The same threading pattern can be extended to S pictures as shown in FIG. The temporal prediction path of the S picture is the same as the temporal prediction path of the L picture. FIG. 10 shows an example SR period of 1/3 (one of all three S0 pictures is SR) for illustration. In practice, different time periods and different threading patterns can be used in accordance with the principles of the present invention. Furthermore, although different paths in the S picture and L picture can be used, it involves a reduction in the coding efficiency of the S picture. Like LR pictures, SR pictures are assumed to be transmitted reliably. This is accomplished using multiple techniques such as DiffServ coding (LR and SR are in HRC), FEC, or ARQ, as described in International Patent Application No. PCT / US06 / 061815. Can do.

漸進的隠蔽技法の例示的切替応用例では、QCIF信号を受信する端末のエンドユーザは、CIF信号に切り替えることを望む場合がある。拡張層CIF信号の復号を開始できるようになるために、端末は、少なくとも1つの正しいCIF基準ピクチャを獲得しなければならない。国際特許出願第PCT/US06/061815号に開示された技法は、周期的イントラマクロブロックの使用を含み、その結果、ある時間期間内に、CIFピクチャの全マクロブロックがイントラコーディングされるようになる。短所は、(総帯域幅に対する影響を最小にするために)イントラマクロブロックのパーセンテージが低く保たれる場合、それを行うのにかなりの時間を要することである。対照的に、漸進的隠蔽技法の切替応用例は、拡張層CIF信号の復号を開始できるようになるために、SRピクチャの確実な伝送を活用する。 In an exemplary switching application of the progressive concealment technique, an end user of a terminal receiving a QCIF signal may desire to switch to a CIF signal. In order to be able to start decoding the enhancement layer CIF signal, the terminal must obtain at least one correct CIF reference picture. The technique disclosed in International Patent Application No. PCT / US06 / 061815 involves the use of periodic intra macroblocks, so that all macroblocks of a CIF picture are intracoded within a certain time period. . The disadvantage is that if the percentage of intra macroblocks is kept low (to minimize the impact on total bandwidth), it takes a considerable amount of time to do so. In contrast, the gradual concealment technique switching application takes advantage of the reliable transmission of SR pictures in order to be able to start decoding the enhancement layer CIF signal.

SRピクチャは、受信器がQCIFレベルで動作する場合であっても、受信器に送信し、復号することができる。SRピクチャは、頻繁ではないので、ビットレートに対するSRピクチャの全体的な影響を最小限にすることができる。ユーザがCIF解像度に切り替える時に、デコーダは、最も最近のSRフレームを利用し、受信された最初のSピクチャまでの中間Sピクチャが失われたかのように進行することができる。追加のビットレートが使用可能である場合には、送信側またはサーバは、受信器がCIF再生の開始フレームにできる限り近い基準ピクチャを構築するのをさらに助けるために、すべての中間S0ピクチャのキャッシングされたコピーを転送することもできる。S1/S2隠蔽技法のレート-ひずみ性能は、品質に対する影響が最小化されることを保証する。 The SR picture can be transmitted to the receiver and decoded even when the receiver operates at the QCIF level. Since SR pictures are infrequent, the overall impact of SR pictures on bit rate can be minimized. When the user switches to CIF resolution, the decoder can use the most recent SR frame and proceed as if the intermediate S picture up to the first received S picture has been lost. If additional bit rates are available, the sender or server can cache all intermediate S0 pictures to further help the receiver build a reference picture as close as possible to the starting frame of CIF playback. You can also transfer the copied copy. The rate-distortion performance of the S1 / S2 concealment technique ensures that the impact on quality is minimized.

本発明の技法は、エンドユーザが中間出力解像度、たとえばHCIFで復号し、その後CIFに切り替えることを望む時に有利に使用することもできる。HCIF信号は、捨てられたSフレームの隠蔽と結合された、L0〜L2とS0〜S2ピクチャの一部(たとえば、S0のみ)とから効果的に導出することができる。この場合に、デコーダは、S0ピクチャの少なくとも一部を受信するが、非常に小さいPSNRペナルティを伴ってCIF解像度に即座に切り替えることができる。さらに、このペナルティは、次のS0/SRピクチャが到着するや否や、除去される。したがって、この場合に、実用上オーバーヘッドはなく、ほとんど瞬間的な切替を達成することができる。 The techniques of the present invention can also be advantageously used when an end user desires to decode at an intermediate output resolution, eg, HCIF, and then switch to CIF. The HCIF signal can be effectively derived from the L0-L2 and part of the S0-S2 picture (eg, only S0) combined with the concealment of the discarded S frame. In this case, the decoder receives at least part of the S0 picture, but can immediately switch to CIF resolution with a very small PSNR penalty. Furthermore, this penalty is removed as soon as the next S0 / SR picture arrives. Therefore, in this case, there is practically no overhead and almost instantaneous switching can be achieved.

通常の空間コーディング構造は、1:4ピクチャ面積比を使用するが、一部のユーザが、1:2の解像度変化をより快適に感じることに留意されたい。したがって、実際には、たとえばデスクトップ通信アプリケーションでは、HCIFからCIFへの切替遷移は、QCIFからCIFへの切替遷移よりはるかに可能性が高い。ビデオ会議での一般的なシナリオは、スクリーン不動産が、他の参加者のより小さいピクチャに囲まれたアクティブ話者のより大きいピクチャに分割されることであり、アクティブ話者イメージは、より大きいイメージを自動的に占める。本明細書で説明するレート制御方法を使用して作成されるより小さいイメージの場合に、そのような切替を、オーバーヘッドを一切伴わずに頻繁に行うことができる。参加者イメージの切替は、オーバーヘッドを一切伴わずに「アクティブ」レイアウト内で頻繁に行うことができる。この特徴は、そのようなアクティブレイアウトを見ることを好む会議参加者と、静的ビューを好む他の会議参加者との両方に対処するのに望ましい。隠蔽による切替方法は、エンコーダによる追加情報の送信を一切必要としないので、ある受信器によるレイアウトの選択は、他の受信器によって受信される帯域幅に影響しない。 Note that normal spatial coding structures use a 1: 4 picture area ratio, but some users feel more comfortable with a 1: 2 resolution change. Thus, in practice, for example, in desktop communication applications, a switching transition from HCIF to CIF is much more likely than a switching transition from QCIF to CIF. A common scenario in video conferencing is that the screen real estate is split into a larger picture of the active speaker surrounded by smaller pictures of other participants, and the active speaker image is a larger image. Automatically occupy. In the case of smaller images created using the rate control methods described herein, such switching can be done frequently without any overhead. Switching between participant images can be done frequently in an “active” layout without any overhead. This feature is desirable to address both conference participants who prefer to see such an active layout and other conference participants who prefer a static view. Since the concealment switching method does not require any additional information transmission by the encoder, the selection of the layout by one receiver does not affect the bandwidth received by other receivers.

前述の説明は、エンコーダによって直接に提供される解像度/ビットレートの間の範囲にまたがる中間解像度およびビットレートに対して効率的なレンダリングの作成に言及したものである。データパーティショニングまたは再量子化など、ビットレートを減らす(たとえば、ドリフトを導入することによる)ための既知の他の方法を、本明細書で説明した発明の方法と共にSVCS/CSVCSによって使用して、ビットストリームのより詳細な操作を実現することができることを理解されたい。たとえば、QCIFおよびCIFだけが使用可能である時に1/3 CIFの解像度が望まれ、SR、S0〜S2コーディング構造が使用されると仮定する。S1およびS2の除去は、1/3 CIFとして効果的に使用するには高すぎるビットレートをもたらす場合がある。さらに、S0の除去は、低すぎるビットレートをもたらし、かつ/または動き関連アーチファクトに起因して視覚的に受け入れられないものになる場合がある。その場合に、データパーティショニングまたは再量子化などの既知の方法を使用してS0フレームのビットの量を減らすことは、より最適化された結果をもたらすために、SR伝送(VBRモードでまたは漸進的隠蔽を使用してのいずれか)とあいまって有用である可能性がある。これらの方法を、S1およびS2レベルに適用して、より微調整されたレート制御を達成できることを理解されたい。 The foregoing description refers to the creation of efficient rendering for intermediate resolutions and bit rates that span the range between resolution / bit rates provided directly by the encoder. Other known methods for reducing bit rate (e.g., by introducing drift), such as data partitioning or requantization, are used by SVCS / CSVCS in conjunction with the inventive method described herein, It should be understood that more detailed manipulation of the bitstream can be achieved. For example, assume that 1/3 CIF resolution is desired when only QCIF and CIF are available, and SR, S0-S2 coding structures are used. Removal of S1 and S2 may result in a bit rate that is too high to be effectively used as a 1/3 CIF. Further, removal of S0 may result in a bit rate that is too low and / or may be visually unacceptable due to motion related artifacts. In that case, reducing the amount of bits in the S0 frame using known methods such as data partitioning or requantization will result in SR transmission (in VBR mode or progressively to yield a more optimized result). May be useful in conjunction with any of these). It should be understood that these methods can be applied to the S1 and S2 levels to achieve more fine-tuned rate control.

本明細書で説明した好ましい実施形態は、H.264 SVC草案標準規格を使用するが、当業者に明白であるとおり、この技法を、複数の空間/品質レベルおよび時間レベルを可能にする任意のコーディング構造に直接に適用することができる。 The preferred embodiment described herein uses the H.264 SVC draft standard, but as will be apparent to those skilled in the art, this technique can be applied to any spatial / quality level and temporal level that allows multiple levels. It can be applied directly to the coding structure.

本発明に従って、本明細書で説明したスケーラブルコーデックおよび隠蔽技法を、ハードウェアおよびソフトウェアの任意の適切な組合せを使用して実施することができることも理解されたい。前述のスケーラブルコーデックを実装し、動作させるソフトウェア(すなわち、命令)は、コンピュータ可読媒体上で提供することができ、このコンピュータ可読媒体は、ファームウェア、メモリ、ストレージデバイス、マイクロコントローラ、マイクロプロセッサ、集積回路、ASIC、オンラインダウンロード可能媒体、および他の使用可能な媒体を含むが、これに限定されない。 It should also be understood that the scalable codec and concealment techniques described herein can be implemented in accordance with the present invention using any suitable combination of hardware and software. Software (ie, instructions) for implementing and operating the scalable codec described above can be provided on a computer readable medium that includes firmware, memory, storage devices, microcontrollers, microprocessors, integrated circuits. , ASICs, online downloadable media, and other usable media.

Claims

A communication network;
A conference server located in the network and linked to at least one receiving and at least one transmitting endpoint by at least one communication channel each on the communication network;
At least one endpoint for transmitting digital video coded using the scalable video coding format;
At least one receiving endpoint capable of decoding digital video signals coded in a scalable video coding format that supports temporal scalability and at least one of spatial scalability and quality scalability, wherein the scalable video coding format includes spatial scalability. Including a base spatial layer and at least one spatial enhancement layer, including a base quality layer and at least one quality enhancement layer for quality scalability, and including a base time layer and at least one time enhancement layer for temporal scalability, the base time layer And the extended temporal layer is based on a threaded picture prediction structure for at least one of the spatial scalability layer or the quality scalability layer. Are linked to each other,
The conference server is signaled or explicitly used in an output video signal where the use of lower spatial layer data or quality layer data is used in decoding pictures at a higher resolution than the base spatial layer or base quality layer. Higher than the base spatial layer or quality layer of the input video signal received from the transmission endpoint before creating the output video signal to be transferred to the at least one reception endpoint. A video communication system configured to selectively remove or change portions corresponding to layers of the.

The scalable video coding format is based on hybrid coding such as H.264 standard, VC-1 standard, or AVS standard, for use within an output video signal transferred to the at least one receiving endpoint. The lower spatial layer data or quality layer data that is signaled or explicitly coded is
Motion vector data,
The difference between the coded prediction errors and
Intra data and
Including at least one of a reference picture indicator and
The lower spatial layer data or quality layer data is further appropriately scaled to a desired target resolution when explicitly coded in an output video signal transmitted to one or more receiving endpoints. The system described in.

The conference server further includes:
A transcoding Multipoint Control Unit using cascaded decoding and encoding; and
A switching multipoint communication device by selecting an input to be transmitted as output;
A scalable video communication server using selective multiplexing;
2. The output of the output video signal forwarded to the at least one receiving endpoint as one of a combined scalable video communication server using selective multiplexing and bitstream level combining. The system described.

The at least one transmitting endpoint encoder is configured to encode the transmitted medium as frames in a threaded coding structure having a plurality of different time levels, wherein a subset of frames (“R”) Frame selected specifically for secure transport, including at least the lowest time layer frame in the threaded coding structure, wherein the at least one receiving endpoint decoder has received type R reliably after a packet loss or error Can decode at least a portion of the received medium and then be synchronized with the encoder, and the conference server creates an output video signal that is forwarded to the at least one receiving endpoint Before starting, only the non-R frame base sky of the input video signal received from the transmission endpoint Selectively removing the portion corresponding to the upper layer than the layer or base quality layer system according to claim 1.

The conference server prevents the retained portion corresponding to the layer higher than the base spatial layer or the base quality layer in the input video signal received from the transmission end point from adversely affecting the smoothness of the output bit rate. The system of claim 1, further configured to control a transmission rate of an output video signal transferred to at least one receiving endpoint.

The system of claim 1, wherein the selectively removing or modifying by the conference server is performed according to a desired output bit rate requirement.

The at least one receiving endpoint receives an output picture decoded with a desired spatial resolution contained directly between the lower spatial layer and directly the upper spatial layer provided by the received coded video signal. The system of claim 1, configured to display.

The at least one receiving endpoint is further configured to scale all the coded data of the directly upper spatial layer to the desired spatial resolution to thereby directly add the upper spatial layer at the desired spatial resolution. Configured to operate the decoding loop, the resulting drift is
Periodic intra pictures,
Periodic use of intra-base layer mode,
The system of claim 7, wherein the system is removed by using at least one of the full spatial resolution of at least the lowest temporal layer of the upper spatial layer directly.

In order to avoid drift when the scalable video coding format is changed or removed, when the coded information of the spatial layer or quality layer above the base corresponds to the base time layer,
Periodic intra pictures,
Periodic intra macroblocks;
The system of claim 1, configured with at least one of threaded picture prediction.

The receiving endpoint is further configured to allow at least the base time to be displayed immediately at a new target layer resolution when the at least one receiving endpoint switches a target layer so that the at least one receiving endpoint can immediately display a decoded picture. The system of claim 1 configured to operate at least one decoding loop for a spatial layer or quality layer above a target spatial layer or quality layer for a layer.

A communication network;
A digital video signal coded in a scalable video coding format that supports one transmission endpoint that transmits digital video coded using the scalable video coding format and temporal scalability and at least one of spatial scalability and quality scalability And the scalable video coding format includes a base spatial layer and at least one spatial enhancement layer for spatial scalability, and includes a base quality layer and at least one quality enhancement layer for quality scalability. A base time layer and at least one time extension layer with respect to time scalability, said base time layer and extended time layer Are linked to each other by threaded picture prediction structure for at least one of the spatial scalability layers or quality scalability layer,
The transmitting endpoint is signaled in an output video signal used in decoding a picture at a higher resolution than the base spatial layer or base quality layer, or explicitly used by the lower spatial layer data or quality layer data In order to be coded, before creating the output video signal to be transferred to the at least one receiving endpoint, the coded video signal of the transmitting endpoint is higher than the base space layer or quality layer. A video communication system configured to selectively remove or change a portion corresponding to a layer.

The scalable video coding format is based on hybrid coding such as H.264 standard, VC-1 standard, or AVS standard, for use within an output video signal transferred to the at least one receiving endpoint. The lower spatial layer data or quality layer data that is signaled or explicitly coded is
Motion vector data,
The difference between the coded prediction errors and
Intra data and
Including at least one of a reference picture indicator and
12. The lower spatial layer data or quality layer data is further appropriately scaled to a desired target resolution when explicitly coded in an output video signal transmitted to one or more receiving endpoints. The system described in.

The transmitting endpoint is configured to encode the medium to be transmitted as a frame in a threaded coding structure having a plurality of different time levels, and a subset of frames (“R”) is for reliable transport And including at least the lowest time layer frame in the threaded coding structure, wherein the at least one receiving endpoint decoder is received based on a type R positively received frame after a packet loss or error. At least a portion of the transmitted medium, and then synchronized with the encoder at the transmitting endpoint, before the transmitting endpoint creates an output video signal to be transmitted to the at least one receiving endpoint. In addition, a base space layer or base product of only non-R frames among the input video signals received from the transmission end point. Selectively removing the portion corresponding to a higher layer than the layer system of claim 11.

In order to prevent the retained portion corresponding to the layer higher than the base spatial layer or the base quality layer in the input video signal of the transmission end point from adversely affecting the smoothness of the output bit rate. The system of claim 11, further configured to control a transmission rate of an output video signal transmitted to at least one receiving endpoint.

The system of claim 11, wherein the selectively removing or changing by the transmitting endpoint is performed according to a desired output bit rate requirement.

The at least one receiving endpoint is an output picture decoded with a desired spatial resolution included between the directly lower spatial layer and directly the upper spatial layer provided by the received coded video signal The system of claim 11, wherein the system is configured to display

The at least one receiving endpoint is further configured to directly scale all coded data of the upper spatial layer to the desired spatial resolution, thereby directly decoding the upper spatial layer decoding loop with the desired spatial resolution. And the resulting drift is
Periodic intra pictures,
Periodic use of intra-base layer mode,
The system of claim 11, wherein the system is removed by using at least one of the full spatial decoding of at least the lowest temporal layer of the upper spatial layer directly.

In order to avoid drift when the scalable video coding format is changed or removed, when the coded information of the spatial layer or quality layer above the base corresponds to the base time layer,
Periodic intra pictures,
Periodic intra macroblocks;
The system of claim 11, configured with at least one of threaded picture prediction.

The receiving endpoint is further configured to allow at least the base time to be displayed immediately at a new target layer resolution when the at least one receiving endpoint switches a target layer so that the at least one receiving endpoint can immediately display a decoded picture. 12. The system of claim 11, configured to operate at least one decoding loop for a spatial layer or quality layer above a target spatial layer or quality layer for a layer.

A method of decoding a digital video signal, wherein the digital video signal is coded in a scalable video coding format that supports temporal scalability and at least one of spatial scalability and quality scalability,
The scalable video coding format includes a base spatial layer and at least one spatial enhancement layer for spatial scalability, includes a base quality layer and at least one quality enhancement layer for quality scalability, and a base time layer and at least one time for temporal scalability. Including an enhancement layer, wherein the base time layer and the enhancement time layer are linked together by a threaded picture prediction structure for at least one of a spatial scalability layer or a quality scalability layer,
The method
Receiving the digital video signal at a decoder;
When a part of the coded information of the target spatial layer or target quality layer is lost or not available for decoding of a picture in the target spatial layer or target quality layer above the corresponding base layer, the thread Using coded information from a spatial layer or quality layer below the target spatial layer or target quality layer in an expression prediction structure.

The decoder is arranged in a receiving end point in a communication network to be linked;
A conference server is linked to the receiving endpoint and the at least one transmitting endpoint by at least one communication channel each on the communication network;
The at least one transmission endpoint transmits the coded digital video coded in the scalable video coding format;
The method corresponds to a layer higher than the base space layer or the quality layer in the input video signal received from the transmission end point before the output of the output video signal to be transferred to the reception end point in the conference server. 21. The method of claim 20, further comprising selectively removing portions to be removed.

The conference server linked to the receiving endpoint and at least one transmitting endpoint;
A transcoding Multipoint Control Unit using cascaded decoding and encoding; and
A switching multipoint communication device by selecting an input to be transmitted as output;
A scalable video communication server using selective multiplexing;
23. The method of claim 21, wherein the method is one of a combining scalable video communication server using selective multiplexing and bitstream level combining.

Said at least one transmitting endpoint encoder further comprising encoding the transmitted medium as a frame in a threaded coding structure having a plurality of different time levels, wherein the subset of frames (“R”) Specially selected for secure transport, the decoder can decode at least a part of the received medium based on a type R positively received frame after packet loss or error and then synchronize with the encoder And including at least the lowest time layer frame in the threaded coding structure, and the conference server is received from the transmitting endpoint before creating the output video signal to be transferred to the receiving endpoint. Of the input video signal, a layer higher than the base spatial layer or the base quality layer of only non-R frames Selectively removing corresponding portions A method according to claim 21.

In the conference server, in order to prevent the retained portion of the input video signal received from the transmission endpoint corresponding to the layer higher than the base space layer or the base quality layer from adversely affecting the smoothness of the output bit rate The method of claim 21, further comprising controlling a transmission rate of an output video signal transferred to at least one receiving endpoint.

The method of claim 21, wherein the selectively removing by the conference server is performed according to a desired output bit rate requirement.

The transmitting endpoint transmits digital video coded using the scalable video coding format,
The communication network links the transmitting endpoint to the receiving endpoint;
The method includes: generating the output video signal to be transmitted to at least one receiving endpoint to achieve a desired output bit rate at the transmitting endpoint before the base of the input video signals at the transmitting endpoint. The method of claim 21, further comprising selectively not transmitting portions corresponding to layers above the spatial layer or base quality layer.

Further comprising encoding the medium to be transmitted as a frame in a threaded coding structure having a plurality of different time levels at the transmitting endpoint, wherein a subset of frames (“R”) is for reliable transport At least a portion of the medium that is selected in particular and includes at least the lowest time layer frame in the threaded coding structure, the decoder receiving on the basis of a type R positively received frame after packet loss or error Can be decoded and then synchronized with the encoder, and the encoder at the transmission end point is more than the base space layer or the base quality layer of the non-R frame only in the input video signal at the transmission end point. 27. The portion corresponding to the upper layer is not selectively transmitted to at least one receiving endpoint. Law.

In order to prevent a retained portion corresponding to a layer higher than the base space layer or the base quality layer in the input video signal of the transmission end point from adversely affecting the smoothness of the output bit rate at the transmission end point. 27. The method of claim 26, further comprising controlling a transmission rate of an output video signal that is transferred to at least one receiving endpoint.

27. The method of claim 26, wherein the selective transmission decision by the transmission endpoint is performed according to a desired output bit rate requirement.

The decoder further comprises displaying an output picture decoded with a desired spatial resolution included between the directly lower spatial layer and directly the upper spatial layer provided by the coded video signal. The method of claim 20.

Further operating the decoding layer of the upper spatial layer directly at the desired spatial resolution by directly scaling all coded data of the upper spatial layer to the desired spatial resolution at the decoder. The resulting drift is
Periodic intra pictures,
Periodic use of intra-base layer mode,
21. The method of claim 20, wherein the method is removed by using at least one of at least one of the directly upper spatial layers and full resolution decoding of the least temporal layer.

The scalable video coding format is further adapted to avoid drift when coded information of the target space layer or target quality layer that is lost or unavailable corresponds to the base time layer,
Periodic intra pictures,
Periodic intra macroblocks;
21. The method of claim 20, comprising at least one of threaded picture prediction.

The scalable video coding format is based on hybrid coding such as H.264 standard, VC-1 standard, or AVS standard, and some or all of the coded information in the target spatial layer or target quality layer is lost. The coded information from a spatial layer or quality layer below the target spatial layer or target quality layer used by the decoder when it is or is not usable is:
Motion vector data appropriately scaled to the resolution of the target space layer or target quality layer;
The difference in coded prediction error upsampled to the resolution of the target spatial layer or target quality layer;
And at least one of intra data up-sampled to the resolution of the target space layer or the target quality layer,
In the method, the decoder uses the decoded picture of the target space layer or the target quality layer as a reference in a decoding process to construct a decoded output picture instead of a decoded reference picture of a lower layer. Further comprising using,
The method of claim 20.

In the decoder, at least the base time layer to enable the decoder to immediately display the decoded picture at a new target spatial layer or target quality layer resolution when the decoder switches between a target spatial layer or a target quality layer 21. The method of claim 20, further comprising operating at least one decoding loop for a spatial layer or quality layer above the target spatial layer or quality layer.

A method of video communication on a communication network comprising: a conference server arranged in the network, each linked to at least one reception and at least one transmission endpoint by at least one communication channel on the communication network The at least one endpoint transmits digital video coded using a scalable video coding format, and the at least one receiving endpoint is temporal scalability as well as at least one of spatial scalability and quality scalability. A digital video signal coded in a scalable video coding format that supports the same, and the scalable video coding format is based on spatial scalability. Includes a spatial layer and at least one spatial enhancement layer, comprises a base quality layer and at least one quality enhancement layer with respect to quality scalability, includes a base time layer and at least one temporal enhancement layer with respect to temporal scalability, the base time layer and extended The temporal layers are linked together by a threaded picture prediction structure for at least one of the spatial scalability layer or the quality scalability layer,
The method
At the conference server, the use of lower spatial layer data or quality layer data is signaled in the output video signal used in decoding a picture at a higher resolution than the base spatial layer or base quality layer, or explicitly Higher than the base spatial layer or quality layer of the input video signal received from the transmission endpoint before creating the output video signal to be transferred to the at least one reception endpoint. Selectively removing or altering a portion corresponding to a layer of the layer.

The scalable video coding format is based on hybrid coding such as H.264 standard, VC-1 standard, or AVS standard, for use within an output video signal transferred to the at least one receiving endpoint. The lower spatial layer data or quality layer data that is signaled or explicitly coded is
Motion vector data,
The difference between the coded prediction errors and
Intra data and
Including at least one of a reference picture indicator and
36. The lower spatial layer data or quality layer data is further appropriately scaled to a desired target resolution when explicitly coded in an output video signal transmitted to one or more receiving endpoints. The method described in 1.

The conference server further includes:
A transcoding Multipoint Control Unit using cascaded decoding and encoding; and
A switching multipoint communication device by selecting an input to be transmitted as output;
A scalable video communication server using selective multiplexing;
36. The method of claim 35, configured to create an output video signal that is forwarded to the at least one receiving endpoint as one of a combined scalable video communication server using selective multiplexing and bitstream level combining. The method described.

Said at least one transmitting endpoint encoder further comprising encoding the transmitted medium as a frame in a threaded coding structure having a plurality of different time levels, wherein the subset of frames (“R”) Frame selected specifically for secure transport, including at least the lowest time layer frame in the threaded coding structure, wherein the at least one receiving endpoint decoder has received type R reliably after a packet loss or error Can decode at least a portion of the received medium and then be synchronized with the encoder, and the conference server creates an output video signal that is forwarded to the at least one receiving endpoint Before the base point of only non-R frames of the input video signal received from the transmission endpoint. Between layers or portions corresponding to a higher layer than the base quality layer or change selectively removed, The method of claim 35.

In the conference server, in order to prevent the retained portion of the input video signal received from the transmission endpoint corresponding to the layer higher than the base space layer or the base quality layer from adversely affecting the smoothness of the output bit rate 36. The method of claim 35, further comprising controlling a transmission rate of an output video signal transferred to at least one receiving endpoint.

36. The method of claim 35, further comprising performing at the conferencing server the selectively removing or modifying according to a desired output bit rate requirement.

At said at least one receiving end point, an output picture decoded with a desired spatial resolution contained between the directly lower spatial layer and directly the upper spatial layer provided by the received coded video signal 36. The method of claim 35, further comprising displaying.

The directly upper spatial layer decoding loop at the desired spatial resolution by scaling all directly coded data of the upper spatial layer to the desired spatial resolution at the at least one receiving endpoint. And the resulting drift is
Periodic intra pictures,
Periodic use of intra-base layer mode,
42. The method of claim 41, wherein the method is removed by using at least one of at least one of the directly upper spatial layers and full resolution decoding of the least temporal layer.

In order to avoid drift when the scalable video coding format is changed or removed, when the coded information of the spatial layer or quality layer above the base corresponds to the base time layer,
Periodic intra pictures,
Periodic intra macroblocks;
36. The method of claim 35, configured using at least one of threaded picture prediction.

At least the base to enable the at least one receiving endpoint to immediately display the decoded picture at the new target layer resolution when the at least one receiving endpoint switches the target layer. 36. The method of claim 35, further comprising operating at least one decoding loop for a spatial layer or quality layer above a target spatial layer or quality layer for the temporal layer.

A communication network;
A digital video signal coded in a scalable video coding format that supports one transmission endpoint that transmits digital video coded using the scalable video coding format and temporal scalability and at least one of spatial scalability and quality scalability A video communication method including at least one receiving endpoint capable of decoding
The scalable video coding format includes a base spatial layer and at least one spatial enhancement layer for spatial scalability, includes a base quality layer and at least one quality enhancement layer for quality scalability, and a base time layer and at least one time for temporal scalability. Including an enhancement layer, wherein the base time layer and the enhancement time layer are linked together by a threaded picture prediction structure for at least one of a spatial scalability layer or a quality scalability layer,
The transmitting endpoint is signaled in the output video signal used in decoding a picture with a higher resolution than the base spatial layer or base quality layer, or the explicit use of lower spatial layer data or quality layer data, or explicitly Before generating the output video signal to be transferred to the at least one receiving endpoint, so that the coded video signal of the transmitting endpoint is higher than the base space layer or the quality layer. A method of video communication configured to selectively remove or change a portion corresponding to a layer of a video.

The scalable video coding format is based on hybrid coding such as H.264 standard, VC-1 standard, or AVS standard, for use within an output video signal transferred to the at least one receiving endpoint. The lower spatial layer data or quality layer data that is signaled or explicitly coded is
Motion vector data,
The difference between the coded prediction errors and
Intra data and
Including at least one of a reference picture indicator and
46. The lower spatial layer data or quality layer data is further appropriately scaled to a desired target resolution when explicitly coded in an output video signal transmitted to one or more receiving endpoints. The method described in 1.

Further comprising encoding the medium to be transmitted as a frame in a threaded coding structure having a plurality of different time levels at the transmitting endpoint, wherein a subset of frames (“R”) is for reliable transport And including at least the lowest time layer frame in the threaded coding structure, wherein the at least one receiving endpoint decoder is received based on a type R positively received frame after a packet loss or error. At least a portion of the transmitted medium, and then synchronized with the encoder at the transmitting endpoint, before the transmitting endpoint creates an output video signal to be transmitted to the at least one receiving endpoint. In addition, the base spatial layer or the base quality of only non-R frames among the input video signals at the transmission end point Or change to selectively remove portions corresponding to the higher layers, the method according to claim 45.

In order to prevent a retained portion corresponding to a layer higher than the base space layer or the base quality layer in the input video signal of the transmission end point from adversely affecting the smoothness of the output bit rate at the transmission end point. 46. The method of claim 45, further comprising controlling a transmission rate of an output video signal transmitted to at least one receiving endpoint.

46. The method of claim 45, further comprising performing the selective removal or modification at the transmitting endpoint according to a desired output bit rate requirement.

Output picture decoded at the at least one receiving end point with a desired spatial resolution included between the directly lower spatial layer and directly the upper spatial layer provided by the received coded video signal 46. The method of claim 45, further comprising displaying.

The directly upper spatial layer decoding loop at the desired spatial resolution by scaling all directly coded data of the upper spatial layer to the desired spatial resolution at the at least one receiving endpoint. And the resulting drift is
Periodic intra pictures,
Periodic use of intra-base layer mode,
51. The method of claim 50, wherein the method is removed by using at least one of the full spatial decoding of at least the lowest temporal layer of the upper spatial layer directly.

In order to avoid drift when the scalable video coding format is changed or removed, when the coded information of the spatial layer or quality layer above the base corresponds to the base time layer,
Periodic intra pictures,
Periodic intra macroblocks;
46. The method of claim 45, configured using at least one of threaded picture prediction.

At least for the base time layer to allow the at least one receiving end point to immediately display the decoded picture at the receiving end point at a new target layer resolution when the at least one receiving end point switches the target layer. 46. The method of claim 45, further comprising operating at least one decoding loop for a spatial layer or quality layer above the target spatial layer or quality layer.

54. A computer readable medium comprising a set of instructions for performing the steps of at least one of the methods of claims 21-53.