JP2009508454A

JP2009508454A - Scalable low-latency video conferencing system and method using scalable video coding

Info

Publication number: JP2009508454A
Application number: JP2008544319A
Authority: JP
Inventors: シヴァンラール，レハ; エレフゼリアディス，アレクサンドロス; ホン，ダニー; シャピロ，オファー
Original assignee: ヴィドヨ，インコーポレーテッド
Priority date: 2005-09-07
Filing date: 2006-07-21
Publication date: 2009-02-26
Also published as: EP1952631A4; JP2013141284A; WO2008060262A1; EP1952631A1

Abstract

異種のエンドポイント/受信者およびネットワーク環境上でホストされるテレビ会議システムおよびアプリケーションで使用するためのスケーラブルなビデオコーデックが提供される。スケーラブルなビデオコーデックは、ソースビデオ信号の符号化された表現を、複数の時間、品質、および空間解像度で提供する。 A scalable video codec is provided for use in videoconferencing systems and applications hosted on heterogeneous endpoint / recipient and network environments. A scalable video codec provides an encoded representation of a source video signal in multiple temporal, quality, and spatial resolutions.

Description

本発明は、マルチメディアおよび電気通信技術に関する。詳細には、本発明は、種々のアクセス機器または端末を有するユーザエンドポイント相互間で、また不均一なネットワークリンクを介して、テレビ会議を行うためのシステムおよび方法に関する。 The present invention relates to multimedia and telecommunications technology. In particular, the present invention relates to a system and method for conducting a video conference between user endpoints having various access devices or terminals and over non-uniform network links.

(関連出願の相互参照)
この出願は、2005年7月20日出願の米国特許仮出願第60/701,108号、2005年9月7日出願の第60/714,741号、および2005年10月4日出願の第60/723,392号の優先権の利益を主張する。さらに、この出願は、同時出願の米国特許出願第[SVCSystem]号、第[Trunk]号、および第[Jitter]号に関連する。前述の優先権および関連出願のすべては、参照によりそのすべてを本明細書に組み込む。 (Cross-reference of related applications)
No. 60 / 701,108, filed Jul. 20, 2005, No. 60 / 714,741 filed Sep. 7, 2005, and No. 60 / 723,392 filed Oct. 4, 2005. Insist on the interests of priority. In addition, this application is related to co-pending US patent applications [SVCSystem], [Trunk], and [Jitter]. All of the foregoing priority and related applications are incorporated herein by reference in their entirety.

テレビ会議システムは、オーディオおよびビデオを共に用いて実時間で、2つ以上の遠隔の参加者/エンドポイントが互いにビデオおよびオーディオを通信することを可能にする。2人の遠隔の参加者だけが含まれる場合、2つのエンドポイント間で適切な電子ネットワークを介して、通信の直接送信を用いることができる。2人を超える参加者/エンドポイントが含まれる場合、その参加者/エンドポイントのすべてに接続するために、マルチポイント会議ユニット(MCU)またはブリッジが一般に使用される。MCUは、例えば、星形構成に接続できる複数の参加者/エンドポイント間の通信を調停する。 Video conferencing systems allow two or more remote participants / endpoints to communicate video and audio to each other in real time using both audio and video. If only two remote participants are involved, direct transmission of communications can be used between the two endpoints via an appropriate electronic network. If more than two participants / endpoints are involved, a multipoint conference unit (MCU) or bridge is commonly used to connect to all of the participants / endpoints. The MCU arbitrates communication between multiple participants / endpoints that can be connected to, for example, a star configuration.

テレビ会議のために、参加者/エンドポイントまたは端末は、適切な符号化および復号デバイスを備える。符号器は、送信エンドポイントにおけるローカルのオーディオおよびビデオ出力を、電子通信ネットワークを介して信号伝送するのに適切な符号化形式にフォーマットする。それとは反対に、復号器は、符号化されたオーディオおよびビデオ情報を有する受信信号を、受信エンドポイントでオーディオを再生し、画像を表示するのに適した復号形式に処理する。 For video conferencing, participants / endpoints or terminals are equipped with appropriate encoding and decoding devices. The encoder formats the local audio and video output at the sending endpoint into a suitable encoding format for signaling over the electronic communication network. In contrast, the decoder processes the received signal with encoded audio and video information into a decoding format suitable for playing the audio at the receiving endpoint and displaying the image.

従来、(例えば、ビデオウィンドウ内の人物の適正な配置を保証するための)フィードバックを提供するために、端末利用者の自分の画像もまた、自分の画面上に表示される。 Traditionally, the terminal user's own image is also displayed on his screen to provide feedback (eg, to ensure proper placement of the person in the video window).

通信ネットワークを介する実際のテレビ会議システムの実装形態では、遠隔の参加者間の対話式テレビ会議の品質は、エンドツーエンドの信号遅延により決定される。200msを超えるエンドツーエンドの遅延は、会議の参加者間の現実味のある生の(live)、または自然な対話を妨げる。このような長いエンドツーエンドの遅延は、他の参加者からの搬送中のビデオおよびオーディオデータがそのエンドポイントに達することができるように、テレビ会議の参加者が積極的に参加または応答することを不自然に抑制させることになる。 In an actual videoconferencing system implementation over a communications network, the quality of an interactive videoconference between remote participants is determined by the end-to-end signal delay. End-to-end delays of over 200 ms prevent realistic live or natural interactions between conference participants. Such long end-to-end delays cause video conference participants to actively participate or respond so that the video and audio data in transit from other participants can reach that endpoint. Will be suppressed unnaturally.

エンドツーエンド信号遅延は、取得遅延(例えば、A/Dコンバータ中のバッファを満たすためにかかる時間)、符号化遅延、送信遅延(エンドポイントのネットワークインターフェース制御装置に、データで満たされたパケットをサブミットするのにかかる時間)、および移送遅延(パケットがエンドポイントからエンドポイントに通信ネットワーク中を移動する時間)を含む。さらに、調停するMCUを介する信号処理時間は、所与のシステムにおける合計のエンドツーエンド遅延の一因となる。 The end-to-end signal delay is the acquisition delay (e.g., the time taken to fill the buffer in the A / D converter), encoding delay, transmission delay (the endpoint network interface controller sends the packet filled with data) Time it takes to submit), and transport delay (the time it takes for a packet to travel through the communication network from endpoint to endpoint). Furthermore, the signal processing time through the arbitrating MCU contributes to the total end-to-end delay in a given system.

MCUの主要なタスクは、すべての参加者に単一のオーディオストリームが伝送されるように、到来するオーディオ信号を混合することであり、また個々の参加者/エンドポイントによって送信されたビデオフレームまたはピクチャを、各参加者のピクチャを含む共通の複合ビデオフレームストリームに混合することである。フレームおよびピクチャという用語は、本明細書において交換可能に使用されること、さらに、個々のフィールドまたは組み合わせたフレームとしてインターレースされたフレームの符号化(フィールドベース、またはフレームベースのピクチャ符号化)を、当業者には自明なものとして組み込まれ得ることに留意されたい。従来の通信ネットワークシステム中で展開されるMCUは、テレビ会議のセッションの参加者すべてに配布される共通の複合ビデオフレームへと混合される個々のピクチャのすべてに対して、単一の共通の解像度(例えば、CIFまたはQCIF解像度)を提供するだけである。したがって、従来の通信ネットワークシステムは、参加者が他の参加者を異なる解像度で見ることのできるカスタマイズされたテレビ会議機能を提供することは容易ではない。このような望ましい機能により、参加者は、例えば、他の特定の参加者(例えば、話している参加者)をCIF解像度で見ることができ、他の発言しない参加者をQCIF解像度で見ることができる。MCUは、テレビ会議の参加者数と同じ回数だけ、ビデオの混合操作を繰り返すことにより、この所望の機能を提供するように構成され得る。しかし、このような構成では、MCU操作により、かなりのエンドツーエンド遅延を生ずる。さらに、MCUは複数のオーディオストリームを復号し、混合し、それらを再符号化するために、また複数のビデオストリームを復号し、(必要に応じて適切に拡大縮小して)単一のフレームにそれらを合成し、さらにその複数のビデオストリームを再度単一のストリームに再符号化するために、十分なデジタル信号処理能力を有する必要がある。テレビ会議のソリューション(米国94588カリフォルニア州プレザントン、Willow Road 4750のPolycom Inc.(ポリコム社)、および米国10166ニューヨーク州ニューヨーク、Park Avenue 200のTandberg(タンバーグ社)による市販のシステムなど)は、許容できる品質および性能レベルを提供するために、専用のハードウェアコンポーネントを使用する必要がある。 The main task of the MCU is to mix incoming audio signals so that a single audio stream is transmitted to all participants, and video frames sent by individual participants / endpoints or Mixing the pictures into a common composite video frame stream containing each participant's picture. The terms frame and picture are used interchangeably herein, and further refer to encoding of frames interlaced as individual fields or combined frames (field-based or frame-based picture encoding) It should be noted that it can be incorporated as obvious to those skilled in the art. MCUs deployed in traditional communication network systems have a single common resolution for all of the individual pictures that are mixed into a common composite video frame that is distributed to all participants in the video conference session. It only provides (eg CIF or QCIF resolution). Therefore, conventional communication network systems are not easy to provide customized video conferencing functions that allow participants to view other participants at different resolutions. Such a desirable feature allows participants to view, for example, other specific participants (e.g., speaking participants) at CIF resolution and other non-speaking participants at QCIF resolution. it can. The MCU may be configured to provide this desired functionality by repeating the video mixing operation as many times as the number of participants in the video conference. However, in such a configuration, the MCU operation causes significant end-to-end delay. In addition, the MCU decodes and mixes multiple audio streams, re-encodes them, and also decodes multiple video streams (with appropriate scaling as needed) into a single frame. It is necessary to have sufficient digital signal processing capability to combine them and re-encode the multiple video streams back into a single stream. Video conferencing solutions (such as a commercial system by Polycom Inc., Willow Road 4750, Polycom Inc., Pleasanton, Calif. 94588, and Tandberg, Inc., Park Avenue 200, New York 10166, USA) are acceptable. Dedicated hardware components need to be used to provide quality and performance levels.

テレビ会議ソリューションの性能レベル、およびそれにより送達される品質はまた、テレビ会議がそれを介して動作するその基礎となる通信ネットワークと強い相関関係がある。ITU H.261、H.263、および H.264規格のビデオコーデックを使用するテレビ会議ソリューションは、許容できる品質を送達するために、損失のほとんどない、または無損失のロバストな通信チャネルを必要とする。必要な通信チャネルの送信速度またはビットレートは、64Kbpsから最高で数Mbpsまでの範囲になり得る。専用のISDN回線で使用された初期のテレビ会議ソリューション、およびより新しいシステムは、高速伝送のために、しばしば、高速のインターネット接続(例えば、フラクショナルT1、T2、T3など)を使用する。さらに、いくつかのテレビ会議ソリューションは、インターネットプロトコル(「IP」)通信を利用するが、これらは、帯域幅の可用性を保証するために、専用ネットワーク環境で実装される。いずれの場合も、従来のテレビ会議ソリューションは、品質伝送のために必要な専用の高速ネットワーキングインフラストラクチャを実装し、かつ維持することに関連するかなりのコストがかかることになる。 The performance level of the video conferencing solution, and the quality delivered thereby, is also strongly correlated with its underlying communication network through which the video conference operates. Video conferencing solutions that use ITU H.261, H.263, and H.264 video codecs require a lossless or lossless robust communication channel to deliver acceptable quality To do. The required communication channel transmission rate or bit rate can range from 64 Kbps up to several Mbps. Early video conferencing solutions used on dedicated ISDN lines, and newer systems, often use high speed Internet connections (eg, fractional T1, T2, T3, etc.) for high speed transmission. In addition, some video conferencing solutions utilize Internet Protocol ("IP") communications, which are implemented in a dedicated network environment to ensure bandwidth availability. In either case, traditional video conferencing solutions will have significant costs associated with implementing and maintaining the dedicated high speed networking infrastructure required for quality transmission.

専用のテレビ会議ネットワークを実装し、維持するコストは、高い帯域幅の会社データネットワーク接続(例えば、100Mbit、イーサネット(登録商標))を利用する最近の「デスクトップテレビ会議」システムにより回避される。これらのデスクトップテレビ会議ソリューションでは、USBベースのデジタルビデオカメラ、および符号化/復号およびネットワーク伝送を実施するための適切なソフトウェアアプリケーションを備える共通のパーソナルコンピュータ(PC)が、参加者/エンドポイントの端末として使用される。 The cost of implementing and maintaining a dedicated video conferencing network is avoided by modern “desktop video conferencing” systems that utilize high bandwidth corporate data network connections (eg, 100 Mbit, Ethernet). These desktop video conferencing solutions include a USB-based digital video camera and a common personal computer (PC) with the appropriate software applications to perform encoding / decoding and network transmission, with participants / endpoint terminals Used as.

マルチメディアおよび電気通信技術における最近の進歩は、IP PBX、インスタントメッセージング、ウェブ会議などのインターネットプロトコル(「IP」)通信システムを用いて、ビデオ通信と会議機能を統合することを含む。テレビ会議をこのようなシステムに有効に統合するために、ポイントツーポイントとマルチポイント通信が共にサポートされなくてはならない。しかし、IP通信システムにおける利用可能なネットワーク帯域幅は、(例えば、時刻および全体のネットワーク負荷に依存して)広範囲に変動し、ビデオ通信に必要な高い帯域幅の通信に対して、これらのシステムを信頼性のないものにする可能性がある。さらに、IP通信システム上で実装されるテレビ会議ソリューションは、インターネットシステムに関連するネットワークチャネルの異種性と、エンドポイント機器の多様性に共に適応しなくてはならない。例えば、参加者は、様々の多様な個人的コンピューティング装置を用いて、非常に異なる帯域幅(例えば、DSL対イーサネット(登録商標))を有するIPチャネルを介してテレビ会議サービスにアクセスすることもあり得る。 Recent advances in multimedia and telecommunications technologies include integrating video communications and conferencing functions using Internet Protocol (“IP”) communication systems such as IP PBX, instant messaging, web conferencing. In order to effectively integrate videoconferencing into such systems, both point-to-point and multipoint communications must be supported. However, the available network bandwidth in IP communication systems varies widely (e.g., depending on time of day and overall network load), and these systems can be used for high bandwidth communications required for video communications. May be unreliable. In addition, video conferencing solutions implemented on IP communication systems must adapt to both the heterogeneity of network channels associated with the Internet system and the diversity of endpoint devices. For example, participants may use a variety of different personal computing devices to access video conferencing services over IP channels with very different bandwidths (e.g., DSL vs. Ethernet). possible.

テレビ電話ソリューションが実装される通信ネットワークは、2つの基本的な通信チャネルアーキテクチャを提供するものとして分類することができる。1つの基本的なアーキテクチャでは、2点間(例えば、ISDN接続、T1回線など)の専用の直接接続または交換接続を介して、保証されたサービス品質(QoS)チャネルが提供される。反対に、第2の基本的なアーキテクチャでは、通信チャネルはQoSを保証しないが、インターネットプロトコル(IP)ベースのネットワーク(例えば、イーサネット(登録商標)LAN)で使用されるものなど、「ベストエフォート」するだけのパケット送達チャネルである。 Communication networks in which videophone solutions are implemented can be categorized as providing two basic communication channel architectures. In one basic architecture, a guaranteed quality of service (QoS) channel is provided through a dedicated direct or switched connection between two points (eg, ISDN connection, T1 line, etc.). Conversely, in the second basic architecture, the communication channel does not guarantee QoS, but `` best effort '', such as that used in Internet Protocol (IP) -based networks (e.g. Ethernet LAN) It's just a packet delivery channel.

IPベースのネットワークにテレビ会議ソリューションを実装することは、少なくとも、低コスト、高い合計の帯域幅、およびインターネットへのアクセスの広範囲な可用性により望ましい可能性がある。前に述べたように、IPベースのネットワークは、通常、ベストエフォートに基づいて動作する、すなわち、パケットがその宛先に達すること、または送信された順序に到着することは保証されない。しかし、ベストエフォートと推定されるチャネルを介して、異なるレベルのサービス品質(QoS)を提供するための技法が開発されてきた。その技法は、いくつかのトラフィックタイプが順位およびRSVPを得るために、クラスによりネットワークトラフィックを指定し、かつ制御するためのDiffServeなどのプロトコルを含むことができる。これらのプロトコルは、ある帯域幅を保証し、かつ/または利用可能な帯域幅の部分に対する遅延を保証することができる。順方向エラー訂正法(FEC)および自動再送要求(ARQ)機構などの技法はまた、失われたパケット伝送に対する回復機構を改善するために、またパケット損失の影響を調停するために使用することができる。 Implementing a video conferencing solution on an IP-based network may be desirable, at least due to low cost, high total bandwidth, and wide availability of access to the Internet. As previously mentioned, IP-based networks typically operate on a best effort basis, i.e., it is not guaranteed that a packet will reach its destination or arrive in the order it was sent. However, techniques have been developed to provide different levels of quality of service (QoS) over channels that are estimated to be best effort. The technique can include protocols such as DiffServe to specify and control network traffic by class in order for some traffic types to obtain rank and RSVP. These protocols can guarantee some bandwidth and / or guarantee a delay for the portion of available bandwidth. Techniques such as Forward Error Correction (FEC) and Automatic Repeat Request (ARQ) mechanisms can also be used to improve recovery mechanisms for lost packet transmissions and to mediate the effects of packet loss. it can.

IPベースのネットワークでテレビ会議ソリューションを実装するには、使用されるビデオコーデックを検討する必要がある。テレビ会議用に指定された規格H.261、H.263コーデックなど、標準のビデオコーデック、およびVideo CDおよびDVD用に、それぞれ、指定された規格MPEG-1およびMPEG-2のメインプロファイル(Main Profile)コーデックは、固定ビットレートで、単一のビットストリーム(「単一レイヤ」)を提供するように設計される。これらのコーデックのいくつかは、様々なビットレートのストリーム(例えば、DVDで使用されるMPEG-2)を提供するためのレート制御を用いずに展開することができる。しかし、実際には、レート制御なしであっても、特有のインフラストラクチャに依存して、目標の動作ビットレートが確立される。これらのビデオコーデックは、ネットワークが一定のビットレートを提供し、かつ送信者と受信者の間で実際的にエラーのないチャネルを提供できるという仮定に基づいている。特に、人と人の通信アプリケーション用に設計されているHシリーズ規格のコーデックは、チャネルエラーが存在する場合にロバスト性を高めるいくつかの追加の機能を提供するが、非常にわずかなパーセンテージのパケット損失に対する耐性があるだけである(通常、最高2〜3%に過ぎない)。 To implement a video conferencing solution on an IP-based network, it is necessary to consider the video codec used. The main profiles (Main Profile) of the specified standards MPEG-1 and MPEG-2 for standard video codecs, such as the standard H.261 and H.263 codecs specified for video conferencing, and Video CD and DVD, respectively. The codec is designed to provide a single bit stream (“single layer”) at a constant bit rate. Some of these codecs can be deployed without rate control to provide various bit rate streams (eg, MPEG-2 used in DVD). However, in practice, even without rate control, depending on the specific infrastructure, a target operating bit rate is established. These video codecs are based on the assumption that the network can provide a constant bit rate and provide a practically error-free channel between the sender and the receiver. In particular, H-series standard codecs designed for human-to-human communication applications provide some additional functionality to increase robustness in the presence of channel errors, but a very small percentage of packets It is only resistant to loss (usually only up to 2-3%).

さらに、標準のビデオコーデックは、「単一レイヤ」符号化技法に基づいており、それは、本質的に、現在の通信ネットワークにより提供される差別化したQoS機能を利用することができない。ビデオ通信に対する単一レイヤ符号化技法のさらなる制限は、アプリケーションにおいて、低い空間解像度表示が必要または望ましい場合であっても、受信エンドポイントまたはMCUで、最高の解像度信号を受信し復号し、ダウンスケーリングを実施する必要のあることである。これは、帯域幅および計算資源を消費することになる。 Furthermore, standard video codecs are based on “single layer” coding techniques, which essentially cannot take advantage of the differentiated QoS features provided by current communication networks. A further limitation of single layer coding techniques for video communication is that the receiving endpoint or MCU receives and decodes and downscales the highest resolution signal, even if a low spatial resolution display is needed or desirable in the application. It is necessary to carry out. This consumes bandwidth and computational resources.

前述の単一レイヤのビデオコーデックとは対照的に、「マルチレイヤ」符号化技法に基づく「スケーラブルな」ビデオコーデックでは、所与のソースビデオ信号に対して、2つ以上のビットストリームが生成される。すなわち、ベースレイヤおよび1つまたは複数のエンハンスメントレイヤである。ベースレイヤは、最小の品質レベルにおけるソース信号の基本的な表現とすることができる。最小の品質表現は、所与のソースビデオ信号のSNR(品質)、空間もしくは時間解像度側面で、またはこれらの諸側面の組合せにおいて低減することができる。1つまたは複数のエンハンスメントレイヤは、ベースレイヤのSNR(品質)、空間もしくは時間解像度側面を高めるための情報に相当する。スケーラブルなビデオコーデックは、異種のネットワーク環境および/または異種の受信者を考慮して開発されてきた。ベースレイヤは、信頼性のあるチャネル、すなわち、保証されたサービス品質(QoS)を有するチャネルを用いて送信することができる。エンハンスメントレイヤは、低減されたQoS、またはQoSなしに送信され得る。その効果は、受信者が、少なくとも最小の品質レベルを有する信号(ベースレイヤ信号)を受信することが保証されることである。同様に、異なる画面サイズを有する可能性のある異種の受信者では、例えば、可搬型装置に、小さなピクチャサイズの信号を送信することができ、また大きなディスプレイを備えるシステムに、フルサイズのピクチャを送信することができる。 In contrast to the single-layer video codec described above, a “scalable” video codec based on a “multi-layer” coding technique produces two or more bitstreams for a given source video signal. The That is, a base layer and one or more enhancement layers. The base layer can be a basic representation of the source signal at a minimum quality level. The minimum quality representation can be reduced in the SNR (quality), spatial or temporal resolution aspects of a given source video signal, or in a combination of these aspects. One or a plurality of enhancement layers corresponds to information for enhancing the SNR (quality), space, or temporal resolution aspect of the base layer. Scalable video codecs have been developed considering heterogeneous network environments and / or heterogeneous recipients. The base layer can transmit using a reliable channel, i.e. a channel with guaranteed quality of service (QoS). The enhancement layer may be transmitted with reduced QoS or no QoS. The effect is that the receiver is guaranteed to receive a signal (base layer signal) having at least a minimum quality level. Similarly, disparate recipients that may have different screen sizes can send a small picture size signal to a portable device, for example, and a full size picture to a system with a large display. Can be sent.

MPEG-2などの規格は、スケーラブルな符号化を実施するためのいくつかの技法を規定する。しかし、「スケーラブルな」ビデオコーデックの実際の使用は、スケーラブルな符号化に関連するコストおよび複雑性の増加により、またビデオに適した高い帯域幅のIPベース通信チャネルの広範囲な可用性の欠如により妨げられてきた。 Standards such as MPEG-2 specify several techniques for implementing scalable coding. However, the actual use of “scalable” video codecs is hampered by the increased cost and complexity associated with scalable coding and by the lack of extensive availability of high-bandwidth IP-based communication channels suitable for video. Has been.

テレビ会議および他のアプリケーションのための改善されたスケーラブルなコーデックソリューションの開発に対して、現在検討が行われている。望ましいスケーラブルなコーデックソリューションは、改善された帯域幅、時間解像度、空間品質、空間解像度、および計算能力のスケーラビリティを提供することになる。特に、多目的なテレビ会議アプリケーションのための簡単化されたMCUアーキテクチャと整合性のあるスケーラブルなビデオコーデックの開発に関心が向けられている。望ましいスケーラブルなコーデックソリューションは、エンドツーエンドの遅延ペナルティが最小であるかまたは全くなしに、電子ネットワーク中でMCUをカスケードに接続することのできるゼロ遅延MCUアーキテクチャを可能にする。 Consideration is currently being given to the development of improved scalable codec solutions for video conferencing and other applications. A desirable scalable codec solution will provide improved bandwidth, temporal resolution, spatial quality, spatial resolution, and computational power scalability. In particular, there is interest in developing scalable video codecs that are consistent with a simplified MCU architecture for multipurpose video conferencing applications. A desirable scalable codec solution enables a zero delay MCU architecture that allows MCUs to be cascaded in an electronic network with minimal or no end-to-end delay penalty.

本発明は、ポイントツーポイントおよびマルチポイント会議アプリケーションのためのスケーラブルなビデオ符号化(SVC)システムおよび方法(総称的に、「ソリューション」)を提供する。SVCソリューションは、複数の時間、品質、および空間解像度で、ソースビデオ信号の符号化され「レイヤ化された」表現を提供する。これらの表現は、エンドポイント/端末の符号器により作成される別個のレイヤ/ビットストリームコンポーネントにより表される。 The present invention provides scalable video coding (SVC) systems and methods (collectively “solutions”) for point-to-point and multipoint conferencing applications. The SVC solution provides an encoded “layered” representation of the source video signal in multiple temporal, quality, and spatial resolutions. These representations are represented by separate layer / bitstream components created by the endpoint / terminal encoder.

SVCソリューションは、エンドポイント/受信側装置における多様性と、例えば、インターネットプロトコルに基づくものなど、ネットワークのベストエフォート性を含む異種のネットワーク特性における多様性とに適応するように設計される。使用されるビデオ符号化技法のスケーラブルな諸態様により、会議アプリケーションは異なるネットワーク条件に適合でき、また異なる端末利用者の要件(例えば、ユーザが、高い空間解像度または低い空間解像度で、他のユーザを見ることを選択できるなど)に適応できるようになる。 SVC solutions are designed to accommodate the diversity in endpoint / receiver devices and the diversity in heterogeneous network characteristics including, for example, network best effort, such as those based on the Internet protocol. Due to the scalable aspects of the video coding techniques used, conferencing applications can adapt to different network conditions, and different terminal user requirements (e.g., users can connect other users with high or low spatial resolution). Can choose to see).

スケーラブルなビデオコーデック設計により、ポイントツーポイントおよびマルチポイントのシナリオで、エラー耐性のあるビデオ送信を可能にし、また搬送中のビデオストリームを復号、または再符号化することなく、かつストリームのエラー耐性を何も減少させることなく、会議ブリッジは、常駐、レートマッチング、エラーの局所化、ランダムエントリ、および個人的なレイアウト会議機能を提供できるようになる。 Scalable video codec design enables error-tolerant video transmission in point-to-point and multipoint scenarios, and stream error resilience without decoding or re-encoding the video stream in transit Without reducing anything, the conference bridge will be able to provide residency, rate matching, error localization, random entry, and personal layout conferencing capabilities.

他のエンドポイントとビデオ通信するように設計されたエンドポイント端末は、ビデオ信号を、送信のために、マルチレイヤのスケーラブルなビデオフォーマットの1つまたは複数のレイヤに符号化できるビデオ符号器/復号器を含む。ビデオ符号器/復号器は、したがって、テレビ会議の参加者数と同じ数になる多くのビデオストリーム中で受信したビデオ信号レイヤを、同時に、または順次、復号することができる。端末は、汎用PCまたは他のネットワークアクセス装置で、ハードウェア、ソフトウェア、またはそれらの組合せで実装することができる。端末に組み込まれたスケーラブルなビデオコーデックは、H.264などの業界標準の符号化方法と整合性があるか、またはそれに基づく符号化方法および技法に基づくことができる。 An endpoint terminal designed for video communication with other endpoints is a video encoder / decoder that can encode a video signal into one or more layers of a multi-layer scalable video format for transmission Including a bowl. The video encoder / decoder can therefore decode simultaneously or sequentially video signal layers received in as many video streams as there are video conference participants. A terminal may be implemented with hardware, software, or a combination thereof on a general purpose PC or other network access device. A scalable video codec embedded in a terminal may be consistent with or based on an encoding method and technique based on or based on an industry standard encoding method such as H.264.

H.264ベースのSVCソリューションでは、スケーラブルなビデオコーデックは、規格H.264 AVC符号化に基づくベースレイヤを作成する。スケーラブルなビデオコーデックはさらに、元の信号と、適切なオフセットで前のレイヤで符号化されたものとの間の差分を、再度H.264 AVCを用いて連続的に符号化することにより、一連のSNRエンハンスメントレイヤを作成する。このスケーラブルなビデオコーデックのバージョンでは、離散コサイン変換(DCT)係数のDC値は、エンハンスメントレイヤでは符号化されず、さらに、従来の非ブロック化フィルタは使用されない。 In the H.264-based SVC solution, the scalable video codec creates a base layer based on the standard H.264 AVC encoding. A scalable video codec can further encode the difference between the original signal and the one encoded at the previous layer with the appropriate offset, again using H.264 AVC, and Create an SNR enhancement layer. In this scalable video codec version, the DC value of the Discrete Cosine Transform (DCT) coefficient is not encoded in the enhancement layer, and no conventional deblocking filter is used.

空間スケーラビリティを実施する手段として、SNRスケーラビリティを使用するように設計されたSVCソリューションでは、ベースレイヤおよびエンハンスメントレイヤのために、異なる量子化パラメータ(QP)が選択される。より高いQPで符号化されるベースレイヤは、受信するエンドポイント/端末で、任意選択で低域通過のフィルタリングが行われ、また表示のためにダウンサンプリングされる。 In an SVC solution designed to use SNR scalability as a means to implement spatial scalability, different quantization parameters (QP) are selected for the base layer and enhancement layer. Base layers encoded with higher QP are optionally low-pass filtered at the receiving endpoint / terminal and down-sampled for display.

他のSVCソリューションでは、スケーラブルなビデオコーデックは、再構成されたベースレイヤのH.264低解像度信号が、符号化器でアップサンプリングされ、元の信号から減算される空間的にスケーラブルな符号器として設計される。その差分は、設定値だけオフセットされた後、高解像度で動作する標準の符号器に供給される。他のバージョンでは、アップサンプリングされたH.264低解像度信号は、規格ベースの高解像度符号器の動き推定プロセスにおける可能な追加の参照フレームとして使用される。 In other SVC solutions, the scalable video codec is a spatially scalable encoder where the reconstructed base layer H.264 low resolution signal is upsampled by the encoder and subtracted from the original signal. Designed. The difference is offset by a set value and then supplied to a standard encoder operating at high resolution. In other versions, the upsampled H.264 low resolution signal is used as a possible additional reference frame in the motion estimation process of a standards-based high resolution encoder.

SVCソリューションは、ネットワーク条件および参加者の表示の好みに動的に応じるために、スレッディングモードまたは空間スケーラビリティモードを調整または変更することを含むことができる。 The SVC solution can include adjusting or changing the threading mode or spatial scalability mode to dynamically respond to network conditions and participant display preferences.

本発明のさらなる特徴、その性質、および様々な利点は、好ましい諸実施形態の以下の詳細な説明、および添付の図面からさらに明らかとなろう。 Further features of the invention, its nature and various advantages will be more apparent from the following detailed description of the preferred embodiments and the accompanying drawings.

その他の形で述べられていない限り、図を通して同じ番号およびキャラクタは、例示の諸実施形態の同様の機能、エレメント、コンポーネント、または部分を示すために使用される。さらに、本発明は、次に図を参照して詳細に述べられるが、それは、例示的な諸実施形態に関連して説明される。 Unless otherwise stated, the same numbers and characters throughout the figures are used to indicate similar functions, elements, components, or portions of the illustrative embodiments. Further, the present invention will now be described in detail with reference to the figures, which will be described in connection with exemplary embodiments.

本発明は、マルチポイントおよびポイントツーポイントのテレビ会議アプリケーションに対するビデオデータ信号のスケーラブルなビデオ符号化(SVC)のためのシステムおよび技法を提供する。SVCシステムおよび技法(総称的に「ソリューション」)は、テレビ会議における異なるユーザの参加者/エンドポイント、ネットワーク伝送能力、環境、または他の要件に応じて、送達されるビデオデータの適合化、またはカスタマイズを可能にするように設計される。発明性のあるSVCソリューションは、便利なゼロまたは低アルゴリズム遅延切換え機構を用いて、会議の参加者相互間をレイヤごとに容易に切り換えることのできるマルチレイヤフォーマットに圧縮されたビデオデータを提供する。例示的なゼロまたは低アルゴリズム遅延切換え機構、すなわち、スケーラブルなビデオ符号化サーバ(SVCS)は、同時出願の米国特許出願第[SVCS]号に述べられている。 The present invention provides systems and techniques for scalable video coding (SVC) of video data signals for multipoint and point-to-point videoconferencing applications. SVC systems and techniques (generally "solutions") adapt the delivered video data, depending on the different user participants / endpoints in the video conference, network transmission capabilities, environment, or other requirements, or Designed to allow customization. Inventive SVC solutions provide video data compressed into a multi-layer format that can be easily switched between conference participants layer by layer using a convenient zero or low algorithm delay switching mechanism. An exemplary zero or low algorithm delay switching mechanism, namely a scalable video coding server (SVCS), is described in co-pending US patent application [SVCS].

図1Aおよび図1Bは、発明性のあるSVCソリューションに基づいた例示的なテレビ会議システム100構成を示す。テレビ会議システム100は、マルチポイントおよびポイントツーポイントのクライアント会議アプリケーションのために、異種の電子的、またはコンピュータネットワーク環境で実装することができる。システム100は、1つまたは複数のネットワーク化されたサーバ(例えば、SVCSまたはMCU 110)を使用して、会議の参加者またはクライアント120、130、および140へのカスタマイズされたデータの送達を調整する。同時係属の米国特許出願第号に述べられているように、MCU 110は、他の会議参加者に送信するために、エンドポイント140により生成されたビデオストリーム150の送達を調整することができる。システム100では、ビデオストリームは、発明性のあるSVC技法を用いて、まず、適切に符号化され、またはダウンスケーリングされて、多数のデータコンポーネントまたはレイヤになる。複数のデータレイヤは、異なる特性または機能を有することができる(例えば、空間解像度、フレームレート、ピクチャ品質、信号対雑音比品質(SNR)など)。データレイヤの異なる特性または機能は、例えば、変化する個々のユーザ要件、および電子ネットワーク環境におけるインフラストラクチャ仕様(例えば、CPU能力、ディスプレイ寸法、ユーザの好み、および帯域幅)を考慮して、適切に選択することができる。MCU 110は、受信したデータストリーム(例えば、SVCビデオストリーム150)から、会議の特定の参加者/受信者ごとに、適切な量の情報(すなわち、SVCレイヤ)を選択するように、また各参加者/受信者120〜130に対して、選択されたまたは要求された量の情報/レイヤだけを転送するように適切に構成される。MCU 110は、受信エンドポイントの要求(例えば、個々の会議参加者により要求されるピクチャ品質)に応じて、またネットワーク条件およびポリシを考慮して、適切な選択を行うように構成され得る。 1A and 1B show an exemplary video conferencing system 100 configuration based on the inventive SVC solution. Video conferencing system 100 may be implemented in a heterogeneous electronic or computer network environment for multipoint and point-to-point client conferencing applications. System 100 uses one or more networked servers (e.g., SVCS or MCU 110) to coordinate delivery of customized data to conference participants or clients 120, 130, and 140 . As described in co-pending US Patent Application No., the MCU 110 can coordinate the delivery of the video stream 150 generated by the endpoint 140 for transmission to other conference participants. In system 100, a video stream is first appropriately encoded or downscaled using inventive SVC techniques into multiple data components or layers. Multiple data layers can have different characteristics or functions (eg, spatial resolution, frame rate, picture quality, signal-to-noise ratio quality (SNR), etc.). Different characteristics or functions of the data layer are appropriate, for example, considering changing individual user requirements and infrastructure specifications (e.g. CPU capacity, display dimensions, user preferences, and bandwidth) in an electronic network environment. You can choose. MCU 110 also selects the appropriate amount of information (i.e., SVC layer) for each particular participant / recipient of the conference from the received data stream (e.g., SVC video stream 150) and each participant Appropriately configured to transfer only a selected or requested amount of information / layers to subscriber / recipients 120-130. The MCU 110 may be configured to make an appropriate selection in response to the request of the receiving endpoint (eg, picture quality required by individual conference participants) and considering network conditions and policies.

このカスタマイズされたデータ選択および転送スキームは、SVCビデオストリームの内部構造を利用し、それにより、異なる解像度、フレームレート、および/または帯域幅などを有する複数のレイヤへと、ビデオストリームを明確に分割することが可能になる。参照される特許出願第[SVCS]号からの転載である図1Bは、会議に対するエンドポイント140の媒体入力を示すSVCビデオストリーム150の例示的な内部構造を示す。SVCビデオストリーム150の例示的な内部構造は、「ベース」レイヤ150b、および1つまたは複数の別個の「エンハンスメント」レイヤ150aを含む。 This customized data selection and transfer scheme takes advantage of the internal structure of the SVC video stream, thereby clearly dividing the video stream into multiple layers with different resolutions, frame rates, and / or bandwidths, etc. It becomes possible to do. FIG. 1B, which is a reproduction from the referenced patent application [SVCS], shows an exemplary internal structure of the SVC video stream 150 showing the media input of the endpoint 140 to the conference. An exemplary internal structure of the SVC video stream 150 includes a “base” layer 150b and one or more separate “enhancement” layers 150a.

図2は、SVCベースのテレビ会議システム(例えば、システム100)で使用するように設計された例示的な参加者/エンドポイント端末140を示す。端末140は、ヒューマンインターフェース入力/出力装置(例えば、カメラ210A、マイクロフォン210B、ビデオディスプレイ250C、スピーカ250D)、ならびに入力および出力信号マルチプレクサおよびデマルチプレクサユニット(例えば、パケットMUX 220AおよびパケットDMUX 220B)に結合されたネットワークインターフェース制御装置カード(NIC)230を含む。NIC 230は、イーサネット(登録商標)LANアダプタまたは他の任意の適切なネットワークインターフェース装置など、標準のハードウェアコンポーネントとすることができる。 FIG. 2 shows an exemplary participant / endpoint terminal 140 designed for use in an SVC-based videoconferencing system (eg, system 100). Terminal 140 couples to human interface input / output devices (e.g., camera 210A, microphone 210B, video display 250C, speaker 250D), and input and output signal multiplexer and demultiplexer units (e.g., packet MUX 220A and packet DMUX 220B). Network interface controller card (NIC) 230. The NIC 230 may be a standard hardware component such as an Ethernet LAN adapter or any other suitable network interface device.

カメラ210Aおよびマイクロフォン210Bは、他の会議参加者に送信するために、参加者のビデオおよびオーディオ信号をそれぞれ捕捉するように設計される。それとは反対に、ビデオディスプレイ250Cおよびスピーカ250Dは、他の参加者から受信したビデオおよびオーディオ信号を、それぞれ表示し、再生するように設計される。ビデオディスプレイ250Cはまた、任意選択で、参加者/端末140の自分のビデオを表示するように構成することもできる。カメラ210Aおよびマイクロフォン210Bの出力は、アナログ-デジタル変換器210cおよび210Dを、それぞれ介してビデオおよびオーディオ符号器210Gおよび210Hに結合される。ビデオおよびオーディオ符号器210Gおよび210Hは、電子通信ネットワークを介して信号を送信するために必要な帯域幅に低減するために、入力ビデオおよびオーディオのデジタル信号を圧縮するように設計される。入力ビデオ信号は、生の信号、または事前に記録され、記憶されたビデオ信号とすることができる。 Camera 210A and microphone 210B are designed to capture the participant's video and audio signals, respectively, for transmission to other conference participants. In contrast, video display 250C and speaker 250D are designed to display and play video and audio signals received from other participants, respectively. Video display 250C may also optionally be configured to display the participant / terminal 140's own video. The outputs of camera 210A and microphone 210B are coupled to video and audio encoders 210G and 210H via analog-to-digital converters 210c and 210D, respectively. Video and audio encoders 210G and 210H are designed to compress the input video and audio digital signals to reduce the bandwidth required to transmit the signals over the electronic communication network. The input video signal can be a raw signal or a pre-recorded and stored video signal.

ビデオ符号器210Gは、パケットMUX 220Aに直接接続される複数の出力を有する。オーディオ符号器210H出力はまた、パケットMUX 220Aに直接接続される。符号器210Gおよび210Hからの圧縮され、レイヤ化されたビデオおよびオーディオのデジタル信号は、NIC 230を経由し、通信ネットワークを介して送信するために、パケットMUX 220Aにより多重化される。反対に、NIC 230により通信ネットワークを介して受信された、圧縮ビデオおよびオーディオのデジタル信号は、多重分離するためにパケットDMUX 220Bに転送され、ビデオディスプレイ250Cおよびスピーカ250Dを介して表示し、再生するために、端末140でさらに処理される。 Video encoder 210G has multiple outputs that are directly connected to packet MUX 220A. The audio encoder 210H output is also directly connected to the packet MUX 220A. The compressed, layered video and audio digital signals from encoders 210G and 210H are multiplexed by packet MUX 220A for transmission over the communication network via NIC 230. Conversely, the compressed video and audio digital signals received by NIC 230 over the communication network are forwarded to packet DMUX 220B for demultiplexing and displayed and played back via video display 250C and speaker 250D. Therefore, the terminal 140 further processes.

捕捉されたオーディオ信号は、知られた技法、例えば、G.711およびMPEG-1を含む適切な任意の符号化技法を用いて、オーディオ符号器210Hにより符号化することができる。テレビ会議システム100および端末140の実装では、オーディオの符号化には、G.711符号化が好ましい。捕捉されたビデオ信号は、本明細書で述べるSVC技法を用いてビデオ符号器210Gによりレイヤ化された符号化フォーマットで符号化される。パケットMUX 220Aは、例えば、RTPプロトコルまたは他の適切なプロトコルを用いて、入力ビデオおよびオーディオ信号を多重化するように構成することができる。パケットMUX 220Aはまた、任意の必要なQoSに関連するプロトコル処理を実施するように構成することもできる。 The captured audio signal can be encoded by the audio encoder 210H using any suitable encoding technique including known techniques, eg, G.711 and MPEG-1. In the implementation of the video conference system 100 and the terminal 140, G.711 encoding is preferable for audio encoding. The captured video signal is encoded in an encoding format layered by video encoder 210G using the SVC technique described herein. Packet MUX 220A may be configured to multiplex input video and audio signals using, for example, the RTP protocol or other suitable protocol. Packet MUX 220A may also be configured to perform any required QoS related protocol processing.

システム100では、端末140からのデータの各ストリームは、電子通信ネットワークを介してそれ自体の仮想チャネル(またはIP用語におけるポート番号)で送信される。例示的なネットワーク構成では、QoSは、特定の仮想チャネルのためのDiffServ (Differentiated Services、差別化されたサービス)を介して、または他の同様の任意のQoSを可能にする技法により提供され得る。必要なQoSセットアップは、本明細書で述べるシステムを使用する前に実施される。DiffServ(または、使用される同様のQoSを可能にする技法)は、ネットワークルータ(図示せず)で、またはそれを介して実施される2つの異なる範疇のチャネルを作成する。説明の都合上、2つの異なる範疇のチャネルを本明細書では、それぞれ、「高信頼性」(HRC)および「低信頼性」(LRC)チャネルと呼ぶ。HRCを確立するための明示的な方法がない場合、またはHRCそれ自体に十分な信頼性がない場合、エンドポイント(またはエンドポイントのためにMCU 110)は、(i)予防的に、HRCを介して情報を繰り返し送信し(実際の繰り返し送信数は、チャネルのエラー状態に依存することができる)、または(ii)例えば、伝送における情報損失が検出され直ちに報告された場合など、受信エンドポイントまたはSVCSの要求があると、情報をキャッシュし再送信することができる。HRCを確立するこれらの方法は、利用可能なチャネルタイプおよび条件に応じて、クライアントからMCU、MCUからクライアント、またはMCUからMCUへの接続において、個々に、または任意の組合せで適用することができる。 In system 100, each stream of data from terminal 140 is transmitted over its electronic communication network on its own virtual channel (or port number in IP terminology). In an exemplary network configuration, QoS may be provided via DiffServ (Differentiated Services) for a particular virtual channel or by any other similar technique that allows QoS. The necessary QoS setup is performed before using the system described herein. DiffServ (or a technique that enables similar QoS used) creates two different categories of channels that are implemented at or through a network router (not shown). For purposes of explanation, two different categories of channels are referred to herein as “high reliability” (HRC) and “low reliability” (LRC) channels, respectively. If there is no explicit way to establish an HRC, or if the HRC itself is not reliable enough, the endpoint (or MCU 110 for the endpoint) will (i) prevent the HRC from The information is transmitted repeatedly (the actual number of repeated transmissions can depend on the error condition of the channel), or (ii) the receiving endpoint, for example, when an information loss in the transmission is detected and reported immediately Or if there is a request for SVCS, the information can be cached and retransmitted. These methods of establishing an HRC can be applied individually or in any combination in a client-to-MCU, MCU-to-client, or MCU-to-MCU connection, depending on available channel types and conditions .

多数の参加者のテレビ会議システムで使用するために、端末140は、端末140で見られる、または聞こえる会議の参加者から受信された信号を復号するように設計されたビデオおよびオーディオ復号器(例えば、復号器230Aおよび230B)の1つまたは複数の対で構成される。復号器230Aおよび230Bの対は、参加者1人ずつの信号を個々に処理するように、またはいくつかの参加者信号を順次処理するように設計することができる。端末140に含まれるビデオおよびオーディオ復号器230Aおよび230Bの対の構成または組合せは、並列にかつ/または順次に処理する符号器の設計特徴を考慮して、端末140で受信した参加者信号のすべてを処理するように適切に選択することができる。さらに、パケットDMUX 220Bは、NIC 230を介して会議参加者からパケット化された信号を受信するように、また並列にかつ/または順次に処理するために、ビデオおよびオーディオ符号器230Aおよび230Bの適切な対に信号を転送するように構成することができる。 For use in a multi-participant video conferencing system, terminal 140 may be a video and audio decoder (e.g., designed to decode signals received from conference participants that are viewed or heard at terminal 140. , One or more pairs of decoders 230A and 230B). The pair of decoders 230A and 230B can be designed to process the signal of each participant individually or to process several participant signals sequentially. The configuration or combination of video and audio decoders 230A and 230B included in terminal 140 is all of the participant signals received at terminal 140, taking into account the design features of the encoders that process in parallel and / or sequentially. Can be appropriately selected to handle. In addition, packet DMUX 220B is suitable for video and audio encoders 230A and 230B to receive packetized signals from conference participants via NIC 230 and to process in parallel and / or sequentially. It can be configured to transfer signals to a pair.

さらに、端末140では、オーディオ復号器230B出力が、オーディオミキサ240およびデジタル-アナログ変換器(DA/C)250Bに接続され、それは、スピーカ250Dを駆動して受信したオーディオ信号を再生する。オーディオミキサ240は、個々のオーディオ信号を合成して再生のために単一の信号にするように設計される。同様に、ビデオ復号器230A出力は、コンポジタ260によりフレームバッファ250Aで合成される。フレームバッファ250Aからの合成されたまたは複合ビデオピクチャは、モニタ250C上に表示される。 Further, in terminal 140, the output of audio decoder 230B is connected to audio mixer 240 and digital-to-analog converter (DA / C) 250B, which drives speaker 250D to reproduce the received audio signal. Audio mixer 240 is designed to combine individual audio signals into a single signal for playback. Similarly, the video decoder 230A output is synthesized by the compositor 260 in the frame buffer 250A. The synthesized or composite video picture from frame buffer 250A is displayed on monitor 250C.

コンポジタ260は、複合フレームまたは表示されるピクチャ中で、対応する指定された位置に各復号ビデオピクチャを配置するように適切に設計することができる。例えば、モニタ250Cの表示は、4つの小さな領域に分割することができる。コンポジタ260は、端末140の各ビデオ復号器230Aから画素データを取得し、その画素データを適切なフレームバッファ250Aの位置に(例えば、下側右のピクチャを満たして)配置することができる。二重のバッファリング(例えば、復号器230Bの出力で1回、フレームバッファ250Aで1回)を回避するために、コンポジタ260は、例えば、復号器230Bの出力画素の配置を駆動するアドレス生成装置として構成することができる。ディスプレイ210C上で個々のビデオ復号器230A出力の配置を最適化するための代替の技法もまた、同様の効果のために使用され得る。 The compositor 260 can be suitably designed to place each decoded video picture at a corresponding designated location in a composite frame or displayed picture. For example, the display on the monitor 250C can be divided into four small areas. The compositor 260 can obtain pixel data from each video decoder 230A of the terminal 140 and place the pixel data at the appropriate frame buffer 250A position (eg, filling the lower right picture). In order to avoid double buffering (for example, once at the output of the decoder 230B and once at the frame buffer 250A), the compositor 260 is, for example, an address generator that drives the arrangement of the output pixels of the decoder 230B. Can be configured. Alternative techniques for optimizing the placement of individual video decoder 230A outputs on display 210C may also be used for similar effects.

図2に示す様々な端末140コンポーネントは、互いに適切にインターフェースされたハードウェアおよび/またはソフトウェアコンポーネントの適切な任意の組合せで実装できることを理解されたい。そのコンポーネントは、別個のスタンドアロンのユニットとすることができるが、あるいはネットワークアクセス機能を有するパーソナルコンピュータまたは他の装置と統合することができる。 It should be understood that the various terminal 140 components shown in FIG. 2 can be implemented with any suitable combination of hardware and / or software components appropriately interfaced with each other. The component can be a separate stand-alone unit, or it can be integrated with a personal computer or other device having network access capabilities.

スケーラブルなビデオ符号化のための端末140で使用されるビデオ符号器を参照すると、図3〜9は、それぞれ、端末140中で展開され得る様々のスケーラブルなビデオ符号器またはコーデック300〜900を示す。 Referring to video encoders used at terminal 140 for scalable video encoding, FIGS. 3-9 illustrate various scalable video encoders or codecs 300-900 that may be deployed in terminal 140, respectively. .

図3は、入力ビデオ信号をレイヤ化された符号化フォーマット(例えば、SVC用語では、レイヤL0、L1、およびL2、ただしL0は最もフレームレートが低い)に圧縮するための例示的な符号器アーキテクチャ300を示す。符号器アーキテクチャ300は、例えば、規格H.264/MPEG-4 AVC設計または他の適切なコーデック設計に基づいた、動き補償された、ブロックベースの変換コーデックを表す。符号器アーキテクチャ300は、動き推定(ME)、動き補償(MC)、および他の符号化機能のための従来の「典型的(text-book)な」様々のビデオ符号化プロセスブロック330に加えて、フレームバッファブロック310、ENC REF制御ブロック320、および非ブロック化フィルタブロック360を含む。システム100/端末140で使用される動き補償された、ブロックベースのコーデックは、単一レイヤの時間予測コーデックとすることができ、それは、I、P、およびBピクチャの正規の構造を有する。ピクチャのシーケンス(表示順)は、例えば、「IBBPBBP」とすることができる。ピクチャシーケンスでは、「P」ピクチャは、前のPまたはIピクチャから予測され、一方、Bピクチャは、前と次のPまたはIピクチャを共に使用して予測される。連続するIまたはPピクチャの間のBピクチャの数は、Iピクチャの出現割合が変化し得るので、変化する可能性があるが、例えば、最近のものよりも時間的に早い他のPピクチャを予測するための参照として、Pピクチャを使用することは可能ではない。規格H.264符号化は、2つの参照ピクチャリストが符号器および復号器、それぞれで維持される例外を提供するので有利である。この例外は、参照としてどのピクチャを使用するか、さらに、符号化すべき特定のピクチャに対してどの参照を使用するかを選択するために本発明により利用される。図3では、フレームバッファブロック310が、参照ピクチャリストを記憶するためのメモリを示す。ENC REF制御ブロック310は、符号器側で、現在のピクチャのためにどの参照ピクチャを使用すべきかを決定するように設計される。 FIG. 3 illustrates an exemplary encoder architecture for compressing an input video signal into a layered coding format (e.g., in SVC terminology, layers L0, L1, and L2, where L0 has the lowest frame rate) 300 is shown. Encoder architecture 300 represents a motion compensated, block-based transform codec, eg, based on a standard H.264 / MPEG-4 AVC design or other suitable codec design. The encoder architecture 300 is in addition to the traditional “text-book” various video encoding process blocks 330 for motion estimation (ME), motion compensation (MC), and other encoding functions. Frame buffer block 310, ENC REF control block 320, and deblocking filter block 360. The motion compensated, block-based codec used in system 100 / terminal 140 may be a single layer temporal prediction codec, which has a regular structure of I, P, and B pictures. The sequence of pictures (display order) can be, for example, “IBBPBBP”. In a picture sequence, a “P” picture is predicted from the previous P or I picture, while a B picture is predicted using both the previous and next P or I pictures. The number of B pictures between consecutive I or P pictures can change because the appearance rate of I pictures can change, but for example, other P pictures that are earlier in time than the recent ones. It is not possible to use a P picture as a reference for prediction. Standard H.264 encoding is advantageous because it provides an exception in which two reference picture lists are maintained at the encoder and decoder, respectively. This exception is exploited by the present invention to select which picture to use as a reference and which reference to use for a particular picture to be encoded. In FIG. 3, the frame buffer block 310 shows a memory for storing the reference picture list. The ENC REF control block 310 is designed on the encoder side to determine which reference picture to use for the current picture.

ENC REF制御ブロック310の動作について、図4で示す例示的なレイヤ化されたピクチャ符号化「スレッディング(threading)」または「予測チェーン」構造をさらに参照して説明する。(図8〜9は、代替のスレッディング構造を示す)。本発明の実装形態で使用されるコーデック300は、複数レベルの時間スケーラビリティ解像度(例えば、L0〜L2)、および他のエンハンスメント解像度(例えば、S0〜S2)を可能にするために、1組の別々のピクチャ「スレッド」(例えば、1組の3スレッド410〜430)を生成するように構成することができる。スレッドまたは予測チェーンは、同じスレッドからのピクチャ、または低レベルのスレッドからのピクチャを用いて動きが補償された一連のシーケンスとして定義される。図4の矢印は、3スレッド410〜430に対する方向、ソース、および予測のターゲットを示す。スレッド410〜430は、共通のソースL0を有するが、異なるターゲットおよび経路(例えば、それぞれ、ターゲットL2、L2、L0)を有する。スレッドを使用することにより、トップレベルのスレッドの任意の数を、残りのスレッドの復号プロセスに影響することなく削除できるので、時間スケーラビリティを実装することが可能になる。 The operation of the ENC REF control block 310 will be described with further reference to the exemplary layered picture coding “threading” or “prediction chain” structure shown in FIG. (FIGS. 8-9 show alternative threading structures). The codec 300 used in the implementation of the present invention is a set of separate to allow multiple levels of temporal scalability resolution (e.g., L0-L2), and other enhancement resolutions (e.g., S0-S2). Picture “threads” (eg, a set of three threads 410-430) may be generated. A thread or prediction chain is defined as a sequence of motion compensated with pictures from the same thread or pictures from lower level threads. The arrows in FIG. 4 indicate the direction, source, and prediction target for the three threads 410-430. Threads 410-430 have a common source L0 but have different targets and paths (eg, targets L2, L2, L0, respectively). By using threads, any number of top-level threads can be deleted without affecting the decoding process of the remaining threads, thus allowing temporal scalability to be implemented.

符号器300では、H.264によれば、ENC REF制御ブロックは、Pピクチャだけを参照ピクチャとして使用できることに留意されたい。しかし、Bピクチャを使用することもでき、付随して全体の圧縮効率が高まる。スレッドの組で(例えば、L2をBピクチャとして符号化することにより)単一のBピクチャを使用しても圧縮効率を改善することができる。従来の対話式通信では、未来のピクチャからの予測を有するBピクチャを使用することは符号化遅延を増加するので、使用が避けられている。しかし、本発明は、実際的にゼロの処理遅延を有するMCUの設計を可能にする。(同時出願の米国特許出願第SVCS号を参照のこと)。このようなMCUを用いれば、Bピクチャを使用しても、なお、現況技術の従来システムよりも低いエンドツーエンド遅延で動作することが可能である。 Note that in encoder 300, according to H.264, the ENC REF control block can only use P pictures as reference pictures. However, B pictures can be used, and the overall compression efficiency is increased. The compression efficiency can also be improved using a single B picture (eg, by encoding L2 as a B picture) in a set of threads. In conventional interactive communication, the use of B pictures with predictions from future pictures increases coding delay and is avoided. However, the present invention allows the design of MCUs with practically zero processing delay. (See co-filed US Patent Application No. SVCS). If such an MCU is used, it is possible to operate with a lower end-to-end delay than the conventional system of the current technology even if a B picture is used.

動作においては、符号器300の出力L0は、単に、4つのピクチャ間隔を空けた1組のPピクチャである。出力L1は、L0と同じフレームレートを有するが、前のL0ピクチャに基づいた予測だけが可能である。出力L2ピクチャは、最近のL0またはL1ピクチャから予測される。出力L0は、最高時間解像度の4分の1 (1:4)を提供し、L1はL0のフレームレートを倍にし(1:2)、L2は、L0+L1のフレームレートを倍にする(1:1)。異なる帯域幅/スケーラビリティ要件、または本発明の実装形態の異なる仕様に適応するように、より少ない数(例えば、L0〜L2の3未満)の、または追加の数のレイヤを、符号器300によって同様に構成することができる。 In operation, the output L0 of the encoder 300 is simply a set of P pictures with four picture intervals. The output L1 has the same frame rate as L0, but only prediction based on the previous L0 picture is possible. The output L2 picture is predicted from the most recent L0 or L1 picture. The output L0 provides a quarter of the highest temporal resolution (1: 4), L1 doubles the frame rate of L0 (1: 2), and L2 doubles the frame rate of L0 + L1 ( 1: 1). A smaller number (e.g., less than 3 from L0 to L2) or an additional number of layers, as well, by encoder 300 to accommodate different bandwidth / scalability requirements, or different specifications of implementations of the invention Can be configured.

本発明によれば、さらなるスケーラビリティのために、各圧縮された時間ビデオレイヤ(例えばL0〜L1)は、SNR品質スケーラビリティおよび/または空間スケーラビリティに関連する1つまたは複数のさらなるコンポーネントを含む、または関連付けることができる。図4は、1つの追加のエンハンスメントレイヤ(SNRまたは空間)を示す。この追加のエンハンスメントレイヤは、3つの異なるコンポーネント(S0〜S2)を有しており、それぞれが、3つの異なる時間レイヤ(L0〜L2)に対応することに留意されたい。 In accordance with the present invention, for further scalability, each compressed temporal video layer (e.g., L0-L1) includes or associates with one or more additional components related to SNR quality scalability and / or spatial scalability. be able to. FIG. 4 shows one additional enhancement layer (SNR or space). Note that this additional enhancement layer has three different components (S0-S2), each corresponding to three different temporal layers (L0-L2).

図5および図6は、それぞれ、SNRスケーラビリティ符号器500および600を示す。図7〜9は、それぞれ、空間スケーラビリティ符号器700〜900を示す。SNRスケーラビリティ符号器500および600、ならびに空間スケーラビリティ符号器700〜900は、符号器300(図3)と同じ処理ブロック(例えば、ブロック330、310、および320)に基づいており、またそれを使用できることが理解されよう。 5 and 6 show SNR scalability encoders 500 and 600, respectively. 7-9 illustrate spatial scalability encoders 700-900, respectively. SNR scalability encoders 500 and 600, and spatial scalability encoders 700-900 are based on and can use the same processing blocks as encoder 300 (Figure 3) (e.g., blocks 330, 310, and 320) Will be understood.

SNRスケーラブルコーデックのベースレイヤの場合、ベースレイヤコーデックへの入力は、最高解像度信号である(図5〜6)ことが理解される。それとは反対に、空間スケーラビリティコーデックのベースレイヤの場合、ベースレイヤコーデックへの入力は、入力信号のダウンサンプリングしたバージョンである図7〜9。SNR/空間品質エンハンスメントレイヤS0〜S2は、近日予定のITU-T H.264 Annex F規格、または他の適切な技法に従って符号化できることもまた留意されたい。 It is understood that for the base layer of the SNR scalable codec, the input to the base layer codec is the highest resolution signal (FIGS. 5-6). In contrast, for the spatial scalability codec base layer, the input to the base layer codec is a downsampled version of the input signal FIGS. 7-9. It should also be noted that the SNR / spatial quality enhancement layers S0-S2 can be encoded according to the upcoming ITU-T H.264 Annex F standard, or other suitable technique.

図5は、例示的なSNRエンハンスメント符号器500の構造を示しており、それは、図3に示すH.264に基づいたレイヤ化符号器300の構造と類似している。しかし、SNRエンハンスメントレイヤコーダ500への入力は、元のピクチャ(図3の入力)と、符号器で再度作成された再構成符号化ピクチャ(図3のREF)との間の差分であることに留意されたい。 FIG. 5 shows the structure of an exemplary SNR enhancement encoder 500, which is similar to the structure of the layered encoder 300 based on H.264 shown in FIG. However, the input to the SNR enhancement layer coder 500 is the difference between the original picture (input in FIG. 3) and the reconstructed coded picture (REF in FIG. 3) recreated by the encoder. Please keep in mind.

図5はまた、前のレイヤの符号化エラーを符号化するためのH.264に基づく符号器500の使用を示す。このような符号化では、負ではない入力が必要である。これを保証するために、符号器500への入力(入力-REF)は正のバイアスにより(例えば、オフセット340により)オフセットされる。正のバイアスは、復号後、ベースレイヤにエンハンスメントレイヤが加えられる前に除去される。H.264のコーデック実装形態で通常使用される非ブロック化フィルタ(例えば、図3の非ブロック化フィルタ360)は、符号器500で使用されない。さらに、主題の符号効率を改善するために、エンハンスメントレイヤ中のDC直接コサイン変換(DCT)係数は、任意選択で、符号器500で無視され、または削除され得る。実験結果は、SNRエンハンスメントレイヤ(S0〜S2)中のDC値の削除は、おそらく、ベースレイヤで行われたすでに微細な量子化により、ピクチャ品質に悪影響を与えないことを示している。この設計の利益は、全く同じ符号化/復号ハードウェアまたはソフトウェアを共に、ベースレイヤとSNRエンハンスメントレイヤに対して使用できることである。同様な方法で、H.264レイヤ符号化をダウンサンプリングした画像に適用することにより、また残りを計算する前に、再構成された画像をアップサンプリングすることにより、空間スケーラビリティ(任意の比で)を導入することができる。さらに、H.264以外の規格を、両方のレイヤを圧縮するために使用することができる。 FIG. 5 also illustrates the use of an H.264-based encoder 500 to encode previous layer encoding errors. Such encoding requires a non-negative input. To ensure this, the input to encoder 500 (input-REF) is offset by a positive bias (eg, by offset 340). The positive bias is removed after decoding and before the enhancement layer is added to the base layer. Deblocking filters that are typically used in H.264 codec implementations (eg, deblocking filter 360 in FIG. 3) are not used in encoder 500. Further, to improve the subject code efficiency, DC direct cosine transform (DCT) coefficients in the enhancement layer may optionally be ignored or removed by encoder 500. Experimental results show that deletion of DC values in the SNR enhancement layer (S0-S2) does not adversely affect picture quality, probably due to the already fine quantization done in the base layer. The benefit of this design is that the exact same encoding / decoding hardware or software can be used for both the base layer and the SNR enhancement layer. In a similar manner, by applying H.264 layer coding to the downsampled image and by upsampling the reconstructed image before calculating the rest, spatial scalability (at any ratio) Can be introduced. In addition, standards other than H.264 can be used to compress both layers.

本発明のコーデックでは、SNRと時間スケーラビリティを切り離すために、時間レイヤ内の、また時間レイヤにわたるすべての動き予測を、ベースレイヤストリームだけを用いて実施することができる。この機能は、矢印415により図4で示されており、LとSブロックの組合せではなく、ベースレイヤブロック(L)における時間予測を示す。この機能のために、すべてのレイヤは、CIF解像度で符号化することができる。次いで、ある時間解像度を有するベースレイヤストリームを復号することにより、また適切な低域通過フィルタリングを用いて、ダイアディックファクタ(dyadic factor)(2)により各空間次元でダウンサンプリングすることにより、QCIF解像度のピクチャを導出することができる。この方法で、SNRスケーラビリティを、空間スケーラビリティを提供するためにも使用することができる。CIF/QCIF解像度は、説明のためだけに参照されることが理解されよう。他の解像度(例えば、VGA/QVGA)も、コーデック設計を何も変更せずに、発明性のあるコーデックによりサポートされ得る。そのコーデックはまた、SNRスケーラビリティ機能を含めるために上記で述べたものと同じ、または類似の方法で、従来の空間スケーラビリティ機能を含むことができる。MPEG-2またはH.264 Annex Fにより提供される技法を、従来の空間スケーラビリティ機能を含めるために使用することができる。 In the codec of the present invention, in order to decouple SNR and temporal scalability, all motion prediction within and across the temporal layer can be performed using only the base layer stream. This function is illustrated in FIG. 4 by arrow 415 and shows temporal prediction in the base layer block (L) rather than a combination of L and S blocks. Because of this function, all layers can be encoded at CIF resolution. The QCIF resolution is then decoded by decoding the base layer stream with a temporal resolution and by downsampling in each spatial dimension with a dyadic factor (2) using appropriate low-pass filtering. Can be derived. In this way, SNR scalability can also be used to provide spatial scalability. It will be understood that the CIF / QCIF resolution is referenced for illustrative purposes only. Other resolutions (eg, VGA / QVGA) may be supported by the inventive codec without changing any codec design. The codec can also include conventional spatial scalability features in the same or similar manner as described above to include SNR scalability features. Techniques provided by MPEG-2 or H.264 Annex F can be used to include traditional spatial scalability features.

上記で述べたSNRおよび時間スケーラビリティを分離するように設計されたコーデックのアーキテクチャにより、1:4(L0だけ)、1:2(L0およびL1)、または1:1(3つのレイヤすべて)の比のフレームレートを可能にする。フレームレートを倍にするためには、100%のビットレート増加が想定され(ベースは合計50%)、Sレイヤを追加するためには、150%の増加がそのスケーラビリティ点で想定される(ベースは合計40%)。好ましい実装形態では、合計のストリームは、例えば、500Kbpsで動作し、ベースレイヤは200Kbpsで動作することができる。ベースレイヤに対して、フレームごとに200/4=50Kbpsのレートロード、また各フレームに対して、(500-200)/4=75Kbpsが想定され得る。前述のターゲットビットレートおよびレイヤビットレート比の値は例示的なものであり、本発明の特徴を説明するために指定されているに過ぎないこと、および発明性のあるコーデックは、他のターゲットビットレート、またはレイヤビットレート比に容易に適合され得ることが理解されよう。 The ratio of 1: 4 (only L0), 1: 2 (L0 and L1), or 1: 1 (all three layers), depending on the codec architecture designed to separate SNR and temporal scalability as described above Allows for a frame rate of. To double the frame rate, a 100% increase in bit rate is assumed (base is 50% in total), and to add an S layer, an increase of 150% is assumed in terms of scalability (base Is a total of 40%). In a preferred implementation, the total stream can operate at, for example, 500 Kbps and the base layer can operate at 200 Kbps. For the base layer, a rate load of 200/4 = 50 Kbps per frame and (500-200) / 4 = 75 Kbps for each frame can be assumed. The above target bit rate and layer bit rate ratio values are exemplary and are only specified to illustrate the features of the present invention, and the inventive codec may have other target bits. It will be appreciated that the rate or layer bit rate ratio can be easily adapted.

理論的には、合計のストリームとベースレイヤが、それぞれ、500Kbpsおよび200Kbpsで動作するとき、最高で1:10のスケーラビリティ(合計対ベース)が利用可能である。表Iは、SNRスケーラビリティが空間スケーラビリティを提供するために使用された場合、利用可能な異なるスケーラビリティの選択肢の諸例を示す。

Theoretically, up to 1:10 scalability (total vs. base) is available when the total stream and base layer operate at 500 Kbps and 200 Kbps, respectively. Table I shows examples of the different scalability options available when SNR scalability is used to provide spatial scalability.

図6は、単一の符号化ループスキームに基づく代替のSNRスケーラブル符号器600を示す。SNRスケーラブル符号器600の構造および動作は、符号器300(図3)の構造および動作に基づく。さらに、符号器600では、Q0により量子化されたDCT係数は、逆量子化され、元の量子化されない係数から減算されて、DCT係数の残余の量子化誤差(QDIFF 610)を取得する。残余の量子化誤差情報(QDIFF 610)は、より細かい量子化器Q1(ブロック620)でさらに量子化され、エントロピー符号化され(VLC/BAC)、またSNRエンハンスメントレイヤSとして出力される。動作するのは単一の符号化ループ、すなわち、ベースレイヤで動作するループであることに留意されたい。 FIG. 6 shows an alternative SNR scalable encoder 600 based on a single coding loop scheme. The structure and operation of SNR scalable encoder 600 is based on the structure and operation of encoder 300 (FIG. 3). Further, in encoder 600, the DCT coefficient quantized by Q0 is inversely quantized and subtracted from the original unquantized coefficient to obtain a residual quantization error (QDIFF 610) of the DCT coefficient. The remaining quantization error information (QDIFF 610) is further quantized by a finer quantizer Q1 (block 620), entropy-coded (VLC / BAC), and output as an SNR enhancement layer S. Note that it is a single coding loop that operates, ie, a loop that operates at the base layer.

端末140/ビデオ230符号器は、SNR品質エンハンスメントレイヤに加えて、またはそれに代えて、空間スケーラビリティエンハンスメントレイヤを提供するように構成することができる。空間スケーラビリティエンハンスメントレイヤを符号化するために、符号器への入力は、元の高解像度ピクチャと、符号器で作成されたアップサンプリングされ、再構成された符号化ピクチャとの差分である。符号器は、入力信号のダウンサンプリングされたバージョンに対して動作する。図7は、空間スケーラビリティのために、ベースレイヤを符号化するための例示的な符号器700を示す。符号器700は、低解像度ベースレイヤ符号器720の入力におけるダウンサンプラ710を含む。CIF解像度における最高の解像度入力信号に対して、ベースレイヤ符号器720は、適切なダウンサンプリングを用いてQCIF、HCIF (half CIF)、またはCIFより低い他の任意の解像度で動作することができる。例示的なモードでは、ベースレイヤ符号器720は、HCIFで動作することができる。HCIFモードの動作は、CIF解像度の入力信号の各ディメンジョンを約√2分の1でダウンサンプリングすることを必要とし、それにより、ピクチャ中の合計画素数を元の入力の約2分の1に削減する。テレビ会議アプリケーションでは、QCIF解像度が表示目的のために望ましい場合、復号されたベースレイヤを、HCIFからQCIFにさらにダウンサンプリングする必要のあることに留意されたい。 The terminal 140 / video 230 encoder may be configured to provide a spatial scalability enhancement layer in addition to or instead of the SNR quality enhancement layer. In order to encode the spatial scalability enhancement layer, the input to the encoder is the difference between the original high resolution picture and the upsampled and reconstructed encoded picture created by the encoder. The encoder operates on a downsampled version of the input signal. FIG. 7 shows an example encoder 700 for encoding the base layer for spatial scalability. The encoder 700 includes a downsampler 710 at the input of the low resolution base layer encoder 720. For the highest resolution input signal at CIF resolution, the base layer encoder 720 can operate at QCIF, HCIF (half CIF), or any other resolution lower than CIF with appropriate downsampling. In an exemplary mode, the base layer encoder 720 can operate with HCIF. Operation in HCIF mode requires that each dimension of the CIF resolution input signal be downsampled by approximately √1 / 2, thereby reducing the total number of pixels in the picture to approximately one half of the original input. Reduce. Note that in video conferencing applications, if QCIF resolution is desired for display purposes, the decoded base layer needs to be further downsampled from HCIF to QCIF.

テレビ会議アプリケーションのために、スケーラブルなビデオ符号化プロセスを最適化することにおける本質的な困難さは、送信されるビデオ信号には2つ以上の解像度が存在することであることが理解される。一方の解像度の品質を改善することは、他方の解像度の品質が対応して低下する結果となる可能性がある。この困難さは、空間的にスケーラブルな符号化に対して、また符号化された解像度と表示解像度が同一である現況技術のテレビ会議システムで特に顕著である。意図する表示解像度から、符号化された信号解像度を分離する発明性のある技法は、各解像度に関連する品質とビットレートの間をよりよくバランスさせるために、コーデック設計者の手段としてさらに他のツールを提供する。本発明によれば、特定のコーデックに対する符号化される解像度の選択は、利用可能な合計の帯域幅、異なる解像度にわたる所望の帯域幅の区画、および所望の品質差の各追加レイヤが提供すべき格差(differential)を考慮し、異なる空間解像度にわたるコーデックのレート-歪み(R-D)性能を考慮することによって取得することができる。 It will be appreciated that an essential difficulty in optimizing a scalable video encoding process for video conferencing applications is that there are more than one resolution in the transmitted video signal. Improving the quality of one resolution may result in a corresponding decrease in the quality of the other resolution. This difficulty is particularly pronounced for spatially scalable coding and in state-of-the-art videoconferencing systems where the encoded resolution and display resolution are the same. Inventive techniques that separate the encoded signal resolution from the intended display resolution are still another means for codec designers to better balance between the quality and bit rate associated with each resolution. Provide tools. According to the present invention, the selection of the encoded resolution for a particular codec should be provided by the total available bandwidth, the desired bandwidth partition across different resolutions, and each additional layer of desired quality difference. It can be obtained by considering the differential and considering the rate-distortion (RD) performance of the codec over different spatial resolutions.

このようなスキームの元で、信号は、CIFで、また3分の1 CIF (1/3CIF)解像度で符号化することができる。表示のために、CIF符号化信号からCIFおよびHCIF解像度信号を共に導き出すことができる。さらに、1/3CIFおよびQCIF解像度信号を共に、表示のために、1/3CIF符号化信号から同様に導き出すことができる。CIFおよび1/3CIF解像度信号は、復号された信号から直接利用可能であるが、一方、後者のHCIFおよびQCIF解像度信号は、復号信号を適切にダウンサンプリングした場合に取得することができる。同様のスキームをまた、他のターゲット解像度(例えば、1/2 VGAおよび1/4 VGAを導出できるVGA および3分の1 VGA)の場合に適用することができる。 Under such a scheme, the signal can be encoded in CIF and at 1/3 CIF (1/3 CIF) resolution. Both CIF and HCIF resolution signals can be derived from the CIF encoded signal for display. Furthermore, both 1 / 3CIF and QCIF resolution signals can be similarly derived from the 1 / 3CIF encoded signal for display. CIF and 1/3 CIF resolution signals can be used directly from the decoded signal, while the latter HCIF and QCIF resolution signals can be obtained if the decoded signal is appropriately downsampled. Similar schemes can also be applied for other target resolutions (eg, VGA and 1/3 VGA from which 1/2 VGA and 1/4 VGA can be derived).

本発明によれば、意図する表示解像度から符号化された信号解像度を分離するスキームは、ビデオ信号レイヤ(図4、および図15、16)をスレッディングするためのスキームと共に、異なるビットレートを有するターゲット空間解像度を得るためのさらなる可能性を提供する。例えば、ビデオ信号符号化スキームで、ソース信号をCIFおよび1/3 CIF解像度で符号化するために、空間スケーラビリティを使用することができる。SNRおよび時間スケーラビリティは、図4に示すように、ビデオ信号に適用することができる。さらに、使用されるSNR符号化は、単一ループまたは二重ループ符号器(例えば、図6の符号器600、または図5の符号器500)とすることができるが、あるいはデータパーティション(DP)により取得することができる。二重ループすなわちDP符号化スキームは、データが失われた、または除去された場合、ドリフトを生ずる可能性が高い。しかし、レイヤ化構造の使用は、失われた、または除去されたデータがL1、L2、S1、またはS2レイヤに属している限り、次のL0ピクチャまで、そのドリフトエラーが伝播するのを制限する。さらに、表示されたビデオ信号の空間解像度が低減されたとき、エラーの知覚が低減されることを考慮すると、L1、L2、S1、およびS2レイヤからデータを削除または除去することにより、1/3 CIF解像度を復号することにより、またそれをQCIF解像度でダウンサンプリングして表示することにより、低帯域幅信号を取得することが可能になる。ダウンサンプリングによるデータ損失は、対応するL1/S1およびL2/S2ピクチャ中にエラーを生じ、未来のピクチャにも(次のL0ピクチャまで)エラーを伝播することになるが、表示解像度が低減されることにより、人間の観察者には品質の低下が見えにくくなる。HCIF、2/3CIFで、または他の所望の任意の解像度で表示するために、同様のスキームをCIF信号に適用することができる。これらのスキームは、有利には、品質スケーラビリティを使用することにより、様々な解像度で、また様々なビットレートで、空間スケーラビリティを行うことが可能になる。 In accordance with the present invention, a scheme for separating the encoded signal resolution from the intended display resolution, along with a scheme for threading the video signal layer (FIGS. 4 and 15, 16), targets with different bit rates. Provides further possibilities for obtaining spatial resolution. For example, in a video signal encoding scheme, spatial scalability can be used to encode a source signal with CIF and 1/3 CIF resolution. SNR and temporal scalability can be applied to video signals as shown in FIG. Further, the SNR encoding used may be a single loop or double loop encoder (e.g., encoder 600 of FIG. 6 or encoder 500 of FIG. 5), or alternatively, data partition (DP) It can be obtained by. Double loop or DP encoding schemes are likely to cause drift if data is lost or removed. However, the use of a layered structure limits its drift error propagation to the next L0 picture as long as the lost or removed data belongs to the L1, L2, S1, or S2 layer . In addition, considering that the perception of error is reduced when the spatial resolution of the displayed video signal is reduced, by deleting or removing data from the L1, L2, S1, and S2 layers, 1/3 By decoding the CIF resolution and displaying it by down-sampling it with the QCIF resolution, a low bandwidth signal can be obtained. Data loss due to downsampling causes errors in the corresponding L1 / S1 and L2 / S2 pictures and propagates the error to future pictures (until the next L0 picture), but the display resolution is reduced This makes it difficult for a human observer to see a decrease in quality. A similar scheme can be applied to the CIF signal for display at HCIF, 2/3 CIF, or any other desired resolution. These schemes advantageously allow for spatial scalability at different resolutions and at different bit rates by using quality scalability.

図8は、例示的な空間的にスケーラブルなエンハンスメントレイヤ符号器800の構造を示しており、それは、符号器500と同様に、前のレイヤの符号化誤差を符号化するための同じH.264符号器構造を使用するが、参照(REF)信号に対してアップサンプラブロック810を含む。このような符号器では、負ではない入力が前提であるので、入力値は、符号化の前に(例えば、オフセット340により)オフセットされる。まだ負のままである値は、ゼロにクリップされる。オフセットは復号後に、またエンハンスメントレイヤをアップサンプリングされたベースレイヤに加える前に除去される。 FIG. 8 shows the structure of an exemplary spatially scalable enhancement layer encoder 800, which, like encoder 500, is the same H.264 for encoding previous layer encoding errors. An encoder structure is used, but includes an upsampler block 810 for the reference (REF) signal. In such an encoder, a non-negative input is assumed, so the input value is offset before encoding (eg, by offset 340). Values that still remain negative are clipped to zero. The offset is removed after decoding and before adding the enhancement layer to the upsampled base layer.

空間エンハンスメントレイヤ符号化の場合、SNRレイヤ符号化(図6)の場合と同様に、DCT係数の量子化器(Q)で周波数の重み付けを使用することが有利になり得る。具体的には、DCおよびその周囲のAC係数に対して、より粗い量子化を使用することができる。例えば、DC係数に対する量子化器のステップサイズを倍にすることは、非常に有効なものとなり得る。 In the case of spatial enhancement layer coding, it may be advantageous to use frequency weighting in the DCT coefficient quantizer (Q), as in SNR layer coding (FIG. 6). Specifically, coarser quantization can be used for DC and its surrounding AC coefficients. For example, doubling the quantizer step size for DC coefficients can be very effective.

図9は、他の空間的にスケーラブルなビデオ符号器900の例示的な構造を示す。符号器900では、符号器800とは異なり、アップサンプリングされ、再構成されたベースレイヤピクチャ(REF)が、入力から減算されるのではなく、動き推定における追加の可能な参照ピクチャとして、またエンハンスメントレイヤ符号器のモード選択ブロック330として働く。符号器900は、したがって、前の符号化された最高解像度のピクチャ(またはBピクチャのための未来のピクチャ)から、あるいは低空間解像度で符号化された同じピクチャのアップサンプリングされたバージョン(レイヤ間予測)から、現在の最高解像度のピクチャを予測するように構成することができる。ダウンサンプラ710、アップサンプラ810、およびオフセット340ブロックを追加するだけで、ベースおよびエンハンスメントレイヤのための同じコーデックを使用して、符号器800を実装することができるが、符号器900は、エンハンスメントレイヤ符号器の動き推定(ME)ブロック330*が変更されることを必要とすることを留意されたい。エンハンスメントレイヤ符号器900は、差分ドメインではなく正規の画素ドメインで動作することにさらに留意されたい。 FIG. 9 shows an exemplary structure of another spatially scalable video encoder 900. In encoder 900, unlike encoder 800, the upsampled and reconstructed base layer picture (REF) is not subtracted from the input, but as an additional possible reference picture in motion estimation and enhancement. Acts as a mode selection block 330 of the layer encoder. The encoder 900 can therefore either from the previous encoded highest resolution picture (or a future picture for a B picture) or an upsampled version (interlayer) of the same picture encoded at low spatial resolution. Prediction) can be configured to predict the current highest resolution picture. Encoder 800 can be implemented using the same codec for the base and enhancement layers, with only the addition of downsampler 710, upsampler 810, and offset 340 blocks. Note that the encoder motion estimation (ME) block 330 * needs to be modified. It is further noted that enhancement layer encoder 900 operates in the regular pixel domain rather than the difference domain.

H.264符号器などの標準の単一レイヤ符号器のBピクチャ予測ロジックを用いることにより、前の高解像度ピクチャと、アップサンプリングされたベースレイヤピクチャとから予測を組み合わせることもまた可能である。これは、第1のピクチャが、正規のまたは標準の前の高解像度ピクチャであり、また第2のピクチャが、ベースレイヤピクチャのアップサンプリングされたバージョンであるように、高解像度信号に対するBピクチャ予測参照を変更することにより達成することができる。次いで、符号器は、第2のピクチャが正規のBピクチャであるかのように予測を実施し、それにより、すべての高効率動きベクトル予測、および符号器の符号化モード(例えば、空間また時間直接モード)を利用する。H.264では、「Bピクチャ」符号化は、2つの参照ピクチャが、共に符号化されるピクチャの過去または未来のピクチャとなり得る意味で、「双方向」ではなく、「2つの予測(bi-predictive)」を表しており、一方、従来の「双方向」Bピクチャ符号化(例えば、MPEG-2)では、2つの参照ピクチャのうちの一方が過去のピクチャであり、他方が未来のピクチャであることに留意されたい。この実施形態は、ピクチャ参照制御ロジックおよびアップサンプリングモジュールに限定された最小の変更で、標準の符号器設計を使用することができる。 It is also possible to combine prediction from the previous high resolution picture and the upsampled base layer picture by using B picture prediction logic of a standard single layer encoder such as an H.264 encoder. This is a B picture prediction for a high resolution signal so that the first picture is a regular or standard previous high resolution picture and the second picture is an upsampled version of the base layer picture This can be achieved by changing the reference. The encoder then performs predictions as if the second picture is a regular B picture, thereby ensuring that all high-efficiency motion vector predictions and the encoding mode of the encoder (e.g., spatial or temporal) Direct mode) is used. In H.264, “B picture” coding means that two reference pictures can be past or future pictures of the pictures that are coded together, instead of “bidirectional”. On the other hand, in conventional “bidirectional” B-picture coding (eg MPEG-2), one of the two reference pictures is a past picture and the other is a future picture. Note that there are. This embodiment can use a standard encoder design with minimal changes limited to picture reference control logic and upsampling modules.

本発明の実装形態では、SNRおよび空間スケーラビリティ符号化モードは、1つの符号器中で組み合わせることができる。このような実装形態では、(例えば、図4で2次元で示された)ビデオスレッディング構造は、追加の第3のスケーラビリティレイヤ(SNRまたは空間)に対応した、3次元に拡張することができる。SNRスケーラビリティが、空間的にスケーラブルなコーデックの最高の解像度信号に追加される実装形態は、利用可能な品質およびビットレートの範囲の点で魅力的なものとなり得る。 In implementations of the invention, SNR and spatial scalability coding modes can be combined in one encoder. In such an implementation, the video threading structure (eg, shown in two dimensions in FIG. 4) can be extended to three dimensions, corresponding to an additional third scalability layer (SNR or space). Implementations in which SNR scalability is added to the highest resolution signal of a spatially scalable codec can be attractive in terms of available quality and bit rate range.

図10〜14は、それぞれ、ベースレイヤ復号器1000、SNRエンハンスメントレイヤ復号器1100、単一ループのSNRエンハンスメントレイヤ復号器1200、空間的にスケーラブルなエンハンスメントレイヤ復号器1300、およびレイヤ間動き予測を有する空間的にスケーラブルなエンハンスメントレイヤ復号器1400のための例示的なアーキテクチャを示す。これらの復号器は、符号器300、500、600、700、800、および900を補足する。復号器1000、1100、1200、1300、および1400は、適切にまたは必要に応じて、端末140の復号器230Aに含まれ得る。 10-14 each have a base layer decoder 1000, an SNR enhancement layer decoder 1100, a single loop SNR enhancement layer decoder 1200, a spatially scalable enhancement layer decoder 1300, and inter-layer motion prediction FIG. 9 shows an example architecture for a spatially scalable enhancement layer decoder 1400. FIG. These decoders supplement the encoders 300, 500, 600, 700, 800, and 900. Decoders 1000, 1100, 1200, 1300, and 1400 may be included in decoder 230A of terminal 140 as appropriate or required.

端末140のスケーラブルビデオ符号化/復号構成は、結果のレイヤを、システム100のHRCおよびLRCを介して送信するためのいくつかの選択肢を提示する。例えば、(L0およびS0)レイヤ、または(L0、S0およびL1)レイヤは、HRCを介して送信することができる。ネットワーク条件、ならびに高信頼性および低信頼性チャネルの帯域幅を十分考慮した後、代替の組合せをまた、所望に応じて使用することができる。例えば、ネットワーク条件に応じて、S0イントラモードを符号化することが望ましいが、保護されたHRCでS0を送信しないことが望ましいこともあり得る。このような場合、予測を含まないイントラモード符号化の頻度は、ネットワーク条件に依存する可能性があり、または受信エンドポイントにより報告される損失に応じて決定され得る。S0予測チェーンは、このような方法でリフレッシュすることができる(すなわち、S0レベルでエラーがある場合、いずれのドリフトも削除される)。 The scalable video encoding / decoding configuration of terminal 140 presents several options for transmitting the resulting layer via the HRC and LRC of system 100. For example, the (L0 and S0) layer or the (L0, S0 and L1) layer can be transmitted via the HRC. After careful consideration of network conditions and the bandwidth of reliable and unreliable channels, alternative combinations can also be used as desired. For example, depending on network conditions, it may be desirable to encode the S0 intra mode, but it may be desirable not to transmit S0 over a protected HRC. In such a case, the frequency of intra-mode coding without prediction may depend on network conditions or may be determined according to the loss reported by the receiving endpoint. The S0 prediction chain can be refreshed in this way (ie, if there is an error at the S0 level, any drift is removed).

図15および16は、代替のスレッディングまたは予測チェーンアーキテクチャ1500および1600を示しており、それは、本発明によるビデオ通信または会議アプリケーションで使用することができる。スレッディング構造または予測チェーン1500および1600の実装形態は、図2〜14を参照して上記で述べたコーデック設計に対して実質的な変更を何も必要としない。 FIGS. 15 and 16 illustrate alternative threading or prediction chain architectures 1500 and 1600, which can be used in video communication or conferencing applications according to the present invention. Implementations of threading structures or prediction chains 1500 and 1600 do not require any substantial changes to the codec design described above with reference to FIGS.

アーキテクチャ1580では、レイヤ(S0、L0、およびL1)の例示的な組合せが、高信頼性チャネル170を介して送信される。図示のように、L1は、L0予測チェーン430の一部であり、S1のためのものではないことに留意されたい。アーキテクチャ1600は、非ダイアディックなフレームレート解像度を達成することもできるスレッディング構成のさらなる例を示す。 In architecture 1580, an exemplary combination of layers (S0, L0, and L1) is transmitted over the reliable channel 170. Note that L1 is part of the L0 prediction chain 430 and not for S1, as shown. Architecture 1600 shows a further example of a threading configuration that can also achieve non-dyadic frame rate resolution.

上記で述べたシステム100および端末140のコーデック設計はフレキシブルであり、容易に、代替のSVCスキームを組み込むように拡張することができる。例えば、Sレイヤの符号化は、近日予定のITU-T H.264 SVC FGS仕様により達成することができる。FGSが使用される場合、Sレイヤの符号化は、生成されたビットストリームに組み込まれた特性により、「S」パケットの任意の部分を使用することができる。より高いレイヤに対する参照ピクチャを作成するために、FGSコンポーネントの部分を使用することが可能であり得る。通信ネットワークを介した送信におけるFGSコンポーネント情報の損失は、復号器でドリフトを生ずる可能性がある。しかし、本発明で使用するスレッディングアーキテクチャは、このような損失の影響を最小化するので有利である。誤差の伝播は、観察者に気付かれないように、わずかなフレーム数に制限することができる。参照ピクチャを作成するために含まれるFGSの量は、動的に変化することができる。 The codec design of the system 100 and terminal 140 described above is flexible and can be easily extended to incorporate alternative SVC schemes. For example, S-layer coding can be achieved by the forthcoming ITU-T H.264 SVC FGS specification. When FGS is used, S layer encoding can use any part of the “S” packet due to the characteristics built into the generated bitstream. It may be possible to use portions of the FGS component to create reference pictures for higher layers. Loss of FGS component information in transmission over the communication network can cause drift in the decoder. However, the threading architecture used in the present invention is advantageous because it minimizes the effects of such losses. Error propagation can be limited to a small number of frames so that the observer is not aware. The amount of FGS included to create the reference picture can change dynamically.

H.264 SVC FGS仕様の提案された特徴は、FGSレイヤにおけるリーク予測技法である。Y. Bao他、、Joint Video Team (JVT)of ISO/IEC MPEG & ITU-T VCEG、15^th meeting、Busan、Korea、2005年4月18〜22日を参照のこと。リーク予測技法は、前のFGSエンハンスメントレイヤピクチャと現在のベースレイヤピクチャとの正規化された重み付き平均を使用することからなる。重み付き平均は、重み付けパラメータのアルファにより制御され、アルファが1である場合、現在のベースレイヤピクチャだけが使用され、一方、それが0である場合、前のFGSエンハンスメントレイヤピクチャだけが使用される。アルファが0である場合は、ゼロの動きベクトルだけを使用する限定された場合である、本発明のSNRエンハンスメントレイヤに対する動き推定(図5のME 330)の使用と同一である。リーク予測技法は、この発明で述べられた正規のMEと共に使用することができる。さらに、FGSレイヤの予測ループを中断し、エラーのドリフトを削除するために、アルファ値を周期的に0に切り換えることが可能である。 A proposed feature of the H.264 SVC FGS specification is a leak prediction technique in the FGS layer. Y. Bao et al. , Joint Video Team (JVT) of ISO / IEC MPEG & ITU-T VCEG, 15 th meeting, Busan, Korea, see the April 18 to 22, 2005. The leak prediction technique consists of using a normalized weighted average of the previous FGS enhancement layer picture and the current base layer picture. The weighted average is controlled by the alpha of the weighting parameter, if alpha is 1, only the current base layer picture is used, while if it is 0, only the previous FGS enhancement layer picture is used . The alpha of 0 is the same as the use of motion estimation (ME 330 in FIG. 5) for the SNR enhancement layer of the present invention, which is a limited case of using only zero motion vectors. The leak prediction technique can be used with the regular ME described in this invention. In addition, the alpha value can be periodically switched to 0 to interrupt the FGS layer prediction loop and eliminate error drift.

図17は、テレビ会議システム100(図1)で使用される例示的なMCU/SVCS 110のスイッチ構造を示す。MCU/SVCS 110は、可能なソース(例えば、エンドポイント120〜140)のそれぞれから、どのパケットが、どの宛先に、どのチャネル(高信頼性対低信頼性)を介して送信されるかを判定し、それに従って信号を切り換える。MCU/SVCS 110の設計および切換え機能は、参照により本明細書に組み込まれる同時出願の米国特許出願第[SVCS]号に述べられている。簡略化のために、MCU/SVCS 110の切換え構造および切換え機能の限定された細部に限って、ここでさらに説明する。 FIG. 17 shows an exemplary MCU / SVCS 110 switch structure used in the video conference system 100 (FIG. 1). MCU / SVCS 110 determines which packets are sent from each possible source (for example, endpoints 120-140) to which destination via which channel (high reliability vs. low reliability) And switch the signal accordingly. The design and switching function of the MCU / SVCS 110 is described in co-pending US patent application [SVCS], which is incorporated herein by reference. For simplicity, only the limited details of the switching structure and switching function of the MCU / SVCS 110 will be further described here.

図18は、MCU/SVCSスイッチ110の例示的な実施形態のオペレーションを示す。MCU/SVCSスイッチ110は、そのメモリ中に2つのデータ構造を維持する。すなわち、SVCSスイッチレイヤ構成マトリックス110A、および SVCSネットワーク構成マトリックス110Bであり、その例が、図19および図20にそれぞれ示される。SVCSスイッチレイヤ構成マトリックス110A(図19)は、各レイヤに対して、またソースと宛先エンドポイント120〜140の各対に対して、特定のデータパケットをどのように処理すべきかに関する情報を提供する。例えば、マトリックス110Aのエレメント値ゼロは、パケットを送信すべきではないことを示し、負のマトリックスエレメント値は、パケット全体を送信すべきであることを示し、また正のマトリックスエレメント値は、パケットデータの指定されたパーセンテージだけを送信すべきであることを示す。パケットのデータの指定されたパーセンテージの送信は、FGSタイプの技法がスケーラブルに符号化された信号に対して使用される場合に限って適切であり得る。 FIG. 18 illustrates the operation of an exemplary embodiment of the MCU / SVCS switch 110. The MCU / SVCS switch 110 maintains two data structures in its memory. That is, SVCS switch layer configuration matrix 110A and SVCS network configuration matrix 110B, examples of which are shown in FIGS. 19 and 20, respectively. SVCS switch layer configuration matrix 110A (FIG. 19) provides information on how a particular data packet should be processed for each layer and for each pair of source and destination endpoints 120-140. . For example, an element value of zero in matrix 110A indicates that the packet should not be transmitted, a negative matrix element value indicates that the entire packet should be transmitted, and a positive matrix element value indicates that the packet data Indicates that only the specified percentage of should be sent. A specified percentage transmission of the data in the packet may be appropriate only if FGS type techniques are used for the scalable encoded signal.

図18はまた、スイッチレイヤ構成マトリックス110A情報を使用してデータパケットを送るためのMCU/SVCS 110におけるアルゴリズム1800を示す。ステップ1802で、MCU/SVCS 110は、受信されたパケットヘッダを調べることができる(H.264を使用すると仮定すると、例えば、NALヘッダ)。ステップ1804で、MCU/SVCS 110は、処理命令および受信されたパケットの指定された宛先を確立するために、ソース、宛先、およびレイヤの組合せに対する関連マトリックス110Aエレメントの値を評価する。FGS符号化を使用するアプリケーションでは、正のマトリックスエレメント値は、パケットのペイロードサイズを低減する必要のあることを示す。それに従って、ステップ1806で、パケットの関連する長さエントリが変更され、データは複製されない。ステップ1808で、関連するレイヤまたはレイヤの組合せは、その指定された宛先に切り換えられる。 FIG. 18 also shows an algorithm 1800 in MCU / SVCS 110 for sending data packets using switch layer configuration matrix 110A information. In step 1802, the MCU / SVCS 110 can examine the received packet header (assuming H.264 is used, for example, a NAL header). In step 1804, the MCU / SVCS 110 evaluates the value of the association matrix 110A element for the source, destination, and layer combination to establish the processing instruction and the specified destination of the received packet. For applications that use FGS encoding, a positive matrix element value indicates that the payload size of the packet needs to be reduced. Accordingly, at step 1806, the associated length entry of the packet is changed and the data is not replicated. At step 1808, the associated layer or combination of layers is switched to its designated destination.

図18および図20を参照すると、SVCSネットワーク構成マトリックス110Bは、参加している各エンドポイントに対するポート番号を追跡する。MCU/SVCS 110は、各レイヤに対してデータを送信し、受信するために、マトリックス110B情報を使用することができる。 Referring to FIGS. 18 and 20, the SVCS network configuration matrix 110B tracks the port number for each participating endpoint. The MCU / SVCS 110 can use the matrix 110B information to send and receive data to each layer.

マトリックス110Aおよび110Bを処理することに基づいたMCU/SVCS 110 の動作により、従来のMCU動作とは対照的に、ゼロまたは最小の内部アルゴリズム遅延で信号切換えを行うことができる。従来のMCUは、様々な参加者に送信するために、到来するビデオを新しいフレームに構成する必要がある。この構成には、到来するストリームの完全な復号、および出力ストリームの再符号化を必要とする。このようなMCUにおける復号/再符号化処理遅延はかなり大きく、必要となる計算能力も同様である。スケーラブルなビットストリームアーキテクチャを用いることにより、また各エンドポイント端末140受信者中に復号器230Aの複数のインスタンスを提供することにより、MCU/SVCS 110は、各受信者宛先に対して適切なレイヤを選択するように到来するパケットをフィルタすることが必要になるだけである。DSP処理が全く必要ないか、あるいは最小で済むことにより、有利には、MCU/SVCS 110を非常にわずかなコストで実装し、(所与の装置で同時にホストされ得るセッション数の点で)優れたスケーラビリティを提供することが可能になり、また直接のエンドポイントツーエンドポイント接続における遅延よりも少しだけ大きい遅延となり得るエンドツーエンド遅延で実装することができる。 The operation of the MCU / SVCS 110 based on processing the matrices 110A and 110B allows signal switching to occur with zero or minimal internal algorithm delay as opposed to conventional MCU operation. Conventional MCUs need to configure incoming video into new frames for transmission to various participants. This configuration requires complete decoding of the incoming stream and re-encoding of the output stream. The decoding / recoding processing delay in such an MCU is quite large, and the required computing power is the same. By using a scalable bitstream architecture and by providing multiple instances of the decoder 230A in each endpoint terminal 140 receiver, the MCU / SVCS 110 provides the appropriate layer for each recipient destination. It is only necessary to filter incoming packets to select. By requiring no or minimal DSP processing, the MCU / SVCS 110 is advantageously implemented at a very low cost and is superior (in terms of the number of sessions that can be hosted simultaneously on a given device). Can be implemented with an end-to-end delay that can be slightly larger than the delay in a direct end-to-end point connection.

端末140およびMCU/SVCS 110は、異なるビットレートおよびストリームの組合せを用いた異なるネットワークシナリオで展開することができる。表IIは、様々な例示的なネットワークシナリオにおける可能なビットレートおよびストリームの組合せを示す。ベース帯域幅/合計帯域幅>=50%がDiffServeレイヤ化の有効性の限界であり、さらに、15fps未満の時間解像度は有用ではないことに留意されたい。

Terminal 140 and MCU / SVCS 110 may be deployed in different network scenarios with different bit rates and stream combinations. Table II shows possible bit rate and stream combinations in various exemplary network scenarios. Note that base bandwidth / total bandwidth> = 50% is the limit of the effectiveness of DiffServe layering, and further, time resolutions below 15 fps are not useful.

端末140および本発明の同様の構成は、異なるQoS保証を提供できるチャネルを介して展開されるポイントツーポイントおよびマルチポイントのテレビ会議システムに関して、スケーラブルな符号化技法を利用できるようにする。本明細書で述べられたスケーラブルなコーデックの選択、スレッディングモデルの選択、高信頼性または低信頼性チャネルを介して送信すべきレイヤの選択、および様々なレイヤに対する適切なビットレート(または量子化器のステップサイズ)の選択は、本発明の特定の実装形態で変化し得る重要な設計パラメータである。通常、このような設計の選択は一度行えばよく、そのパラメータは、テレビ会議システムの展開中、または少なくとも、特定のテレビ会議セッション中は一定のままである。しかし、本発明のSVC構成は、これらのパラメータを単一のテレビ会議セッション内で動的に調整するフレキシビリティを提供することを理解されたい。参加者の/エンドポイントの要件(例えば、他のどの参加者が受信すべきか、どの解像度でなど)、およびネットワーク条件(例えば、損失の割合、ジッタ、各参加者に対する帯域幅の可用性、高信頼性と低信頼性チャネルの間を区分する帯域幅など)を考慮したパラメータの動的な調整が望ましい可能性がある。適切な動的な調整スキームの下で、個々の参加者/エンドポイントは対話式に、異なるスレッディングパターン間で(例えば、図4、8、および9で示すスレッディングパターン間で)切り換え、レイヤを高信頼性および低信頼性チャネルに割り当てる方法の変更を選択し、1つまたは複数のレイヤの削除を選択し、または個々のレイヤのビットレートを変更することができる。同様に、MCU/SVCS 110は、様々な参加者にリンクしている高信頼性および低信頼性チャネルにレイヤを割り当てる方法を変更し、1つまたは複数のレイヤを削除し、FGS/SNRエンハンスメントレイヤを何人かの参加者に対してスケール変更するように構成することができる。 Terminal 140 and similar configurations of the present invention make it possible to utilize scalable coding techniques for point-to-point and multipoint video conferencing systems deployed over channels that can provide different QoS guarantees. Scalable codec selection as described herein, threading model selection, selection of layers to transmit over a reliable or low-reliability channel, and appropriate bit rates (or quantizers) for the various layers The step size) is an important design parameter that can vary in a particular implementation of the invention. Typically, such design choices need only be made once, and the parameters remain constant during the video conference system deployment, or at least during a particular video conference session. However, it should be understood that the SVC configuration of the present invention provides the flexibility to dynamically adjust these parameters within a single video conference session. Participant / endpoint requirements (e.g. which other participants should receive, at what resolution) and network conditions (e.g. loss rate, jitter, bandwidth availability for each participant, high reliability) Dynamic adjustment of the parameters in consideration of the bandwidth that divides between reliable and unreliable channels) may be desirable. Under an appropriate dynamic adjustment scheme, individual participants / endpoints can interactively switch between different threading patterns (e.g., between the threading patterns shown in Figures 4, 8, and 9) to increase layers. One can choose to change the method assigned to the reliable and unreliable channels, choose to delete one or more layers, or change the bit rate of individual layers. Similarly, the MCU / SVCS 110 changes the way in which layers are assigned to reliable and unreliable channels linked to various participants, deletes one or more layers, and FGS / SNR enhancement layer. Can be configured to scale to some participants.

例示的なシナリオでは、テレビ会議は3人の参加者、A、B、およびCを有することができる。参加者AおよびBは、200Kbpsの連続レートを保証できる高速500Kbpsチャネルへのアクセス権を有することができる。参加者Cは、100Kbpsを保証できる200Kbpsチャネルへのアクセス権を有することができる。参加者Aは、以下のレイヤを有する符号化スキームを使用することができる。すなわち、ベースレイヤ(「Base」)、CIF解像度で、7.5fps、15fps、30fpsビデオを提供する時間スケーラビリティレイヤ(「Temporal」)、および3つの時間フレームレートのいずれかで空間解像度を増加させることができるSNRエンハンスメントレイヤ(「FPS」)である。BaseおよびTemporalコンポーネントはそれぞれ、100Kbpsを必要とし、FGSは300Kbpsが必要であり、合計500Kbpsの帯域幅となる。参加者Aは、全部で3つのBase、Temporal、およびFPSコンポーネントをMCU 110に送信することができる。同様に、参加者Bは全部で3つのコンポーネントを受信することができる。しかし、シナリオでは、参加者Bには200Kbpsが保証されているだけなので、FGSは、保証されていない300Kbpsのチャネルセグメントを介して送信される。参加者Cは、BaseおよびTemporalコンポーネントだけを受信することができ、Baseコンポーネントは100Kbpsで保証されている。利用可能な(保証された、または合計の)帯域幅が変わった場合、参加者Aの符号器(例えば、端末140)は、それに応じて、コンポーネントのいずれかに対するターゲットビットレートを動的に変更することができる。例えば、保証された帯域幅が200Kbpsを超える場合、さらに多くのビットを、BaseおよびTemporalコンポーネントに割り振ることができる。符号化は実時間で行われる(すなわち、ビデオは事前に符号化されていない)ので、このような変更は、実時間応答で動的に実施することができる。 In an exemplary scenario, a video conference can have three participants, A, B, and C. Participants A and B can have access to a high speed 500 Kbps channel that can guarantee a continuous rate of 200 Kbps. Participant C may have access to a 200 Kbps channel that can guarantee 100 Kbps. Participant A can use an encoding scheme with the following layers: That is, increasing the spatial resolution at one of the three temporal frame rates, the base layer (“Base”), the temporal scalability layer (“Temporal”) providing 7.5fps, 15fps, 30fps video at CIF resolution A possible SNR enhancement layer ("FPS"). Base and Temporal components each require 100 Kbps and FGS requires 300 Kbps, for a total bandwidth of 500 Kbps. Participant A can send all three Base, Temporal, and FPS components to MCU 110. Similarly, Participant B can receive all three components. However, in the scenario, since Participant B is only guaranteed 200 Kbps, the FGS is transmitted over a non-guaranteed 300 Kbps channel segment. Participant C can only receive Base and Temporal components, which are guaranteed at 100 Kbps. If the available (guaranteed or total) bandwidth changes, Participant A's encoder (eg, terminal 140) dynamically changes the target bit rate for any of the components accordingly. can do. For example, if the guaranteed bandwidth exceeds 200 Kbps, more bits can be allocated to the Base and Temporal components. Since the encoding is performed in real time (ie, the video is not pre-encoded), such changes can be performed dynamically with a real-time response.

参加者BおよびCが共に、制限された能力、例えば、100Kbpsのチャネルによりリンクされている場合、参加者Aは、Baseコンポーネントだけを送信することを選ぶことができる。同様に、参加者BおよびCが、受信したビデオをQCIF解像度でのみ見ることを選択した場合、受信したCIFビデオをQCIF解像度へとダウンサンプリングすることによりFGSコンポーネントにより提供される追加の品質エンハンスメントは失われることになるので、参加者Aは、FGSコンポーネントを送信しないことにより応ずることができる。 If Participants B and C are both linked by limited capacity, for example, a 100 Kbps channel, Participant A can choose to send only the Base component. Similarly, if Participants B and C choose to view the received video only at QCIF resolution, the additional quality enhancement provided by the FGS component by downsampling the received CIF video to QCIF resolution is Participant A can respond by not sending the FGS component because it will be lost.

いくつかのシナリオでは、単一レイヤのビデオストリーム(ベースレイヤまたは合計のビデオ)を送信し、スケーラビリティレイヤの使用を全く回避するのが適切であり得ることに留意されたい。 Note that in some scenarios it may be appropriate to transmit a single layer video stream (base layer or total video) and avoid the use of scalability layers at all.

HRCおよびLRCを介してスケーラブルなビデオレイヤを送信することにおいて、LRC上の情報が失われた場合は常に、HRC上で送信された情報だけを、ビデオ再構成および表示のために使用することができる。実際には、表示されるビデオピクチャのいくつかの部分は、ベースレイヤおよび指定されたエンハンスメントレイヤを復号することにより生成されたデータを含むことになるが、他の部分は、ベースレイヤだけを復号することにより生成されたデータを含む。異なるベースレイヤとエンハンスメントレイヤの組合せに関連する品質レベルがかなり異なる場合、失われたLRCデータを含む表示されたビデオピクチャと、含まないビデオピクチャとの品質差は、顕著なものとなり得る。視覚的な影響は、時間的な次元でさらにはっきりと分かる可能性があり、ベースレイヤから「ベースレイヤプラスエンハンスメントレイヤ」に、表示されるピクチャを繰り返し変更することはフリッカとして知覚され得る。この影響を緩和するために、ベースレイヤピクチャと、「ベースレイヤプラスエンハンスメントレイヤ」ピクチャとの間の品質差(例えば、PSNRの点で)が、特に、フリッカが視覚的により明確であるピクチャの静的部分に対して、低く保持することを保証するのが望ましい場合がある。ベースレイヤピクチャと、「ベースレイヤプラスエンハンスメントレイヤ」ピクチャとの間の品質差は、ベースレイヤそれ自体の品質を高めるようにする適切なレート制御技法を用いることにより、意図的に低く保持することができる。このような1つのレート制御技法は、L0ピクチャのすべてまたはいくつかを、より低いQP値(すなわち、より細かい量子化値)で符号化することであり得る。例えば、どのL0ピクチャも、3分の1に下げたQPで符号化することができる。このような、より細かい量子化は、ベースレイヤの品質を高めることができ、したがって、エンハンスメントレイヤ情報の損失により生じたどんなフリッカの影響も、または等価な空間アーティファクトも最小化する。より低いQP値をまた、1つおきのL0ピクチャに、または4つのL0ピクチャごとに適用することもでき、フリッカおよび同様のアーティファクトを緩和するのに同様の有効性がある。SNRと空間スケーラビリティの組合せの特有な使用法(例えば、QCIF品質を担うベースレイヤを表すために、HCIF符号化を使用するなど)により、ベースレイヤに適用される適正なレート制御で、静的オブジェクトをHCIF解像度に近付けることを可能にし、したがって、エンハンスメントレイヤが失われたときに生ずるフリッカアーティファクトを低減する。 In transmitting scalable video layers over HRC and LRC, whenever information on LRC is lost, only the information transmitted on HRC can be used for video reconstruction and display. it can. In practice, some parts of the displayed video picture will contain data generated by decoding the base layer and the specified enhancement layer, while other parts only decode the base layer. The data generated by doing is included. If the quality levels associated with different base layer and enhancement layer combinations are significantly different, the quality difference between the displayed video picture with lost LRC data and the video picture without it can be significant. The visual impact can be more clearly seen in the temporal dimension, and repeatedly changing the displayed picture from the base layer to “base layer plus enhancement layer” can be perceived as flicker. To mitigate this effect, the quality difference between the base layer picture and the “base layer plus enhancement layer” picture (for example, in terms of PSNR), especially in pictures where flicker is visually more apparent. It may be desirable to ensure that the target part is kept low. The quality difference between the base layer picture and the “base layer plus enhancement layer” picture can be intentionally kept low by using appropriate rate control techniques to enhance the quality of the base layer itself. it can. One such rate control technique may be to encode all or some of the L0 pictures with lower QP values (ie, finer quantization values). For example, any L0 picture can be encoded with a QP reduced to one third. Such finer quantization can increase the quality of the base layer, thus minimizing any flicker effects or equivalent spatial artifacts caused by loss of enhancement layer information. Lower QP values can also be applied to every other L0 picture, or every four L0 pictures, with similar effectiveness in mitigating flicker and similar artifacts. Static objects with proper rate control applied to the base layer, with a unique usage of a combination of SNR and spatial scalability (e.g. using HCIF coding to represent the base layer responsible for QCIF quality) Can be brought closer to the HCIF resolution, thus reducing flicker artifacts that occur when the enhancement layer is lost.

本発明の好ましい諸実施形態と考えられるものを述べてきたが、本発明の趣旨から逸脱することなく、他の変更、さらなる変更および修正を諸実施形態に対して加えることができること、およびこのような変更および修正はすべて、本発明の真の範囲に含まれるものとして特許請求するように意図されていることを当業者であれば理解されよう。 While what has been considered as preferred embodiments of the invention has been described, other changes, further changes and modifications can be made to the embodiments, and thus, without departing from the spirit of the invention. Those skilled in the art will appreciate that all such changes and modifications are intended to be claimed as falling within the true scope of the present invention.

本発明による、本明細書に述べられたスケーラブルなコーデックは、ハードウェアとソフトウェアの適切な任意の組合せを用いて実施できることもまた理解されよう。前述のスケーラブルなコーデックを実装しまた動作させるためのソフトウェア(すなわち、命令)は、コンピュータ可読媒体上で提供することができ、それは、限定することなく、ファームウェア、メモリ、記憶装置、マイクロ制御装置、マイクロプロセッサ、集積回路、ASICS、オンラインのダウンロード可能な媒体、および他の利用可能な媒体を含むことができる。 It will also be appreciated that the scalable codec described herein in accordance with the present invention can be implemented using any suitable combination of hardware and software. Software (i.e., instructions) for implementing and operating the scalable codec described above can be provided on a computer readable medium, including but not limited to firmware, memory, storage, microcontroller, Microprocessors, integrated circuits, ASICS, online downloadable media, and other available media can be included.

本発明の原理による、テレビ会議システムの例示的なアーキテクチャを示す概略図である。1 is a schematic diagram illustrating an exemplary architecture of a video conference system in accordance with the principles of the present invention. FIG. 本発明の原理による、テレビ会議システムの例示的なアーキテクチャを示す概略図である。1 is a schematic diagram illustrating an exemplary architecture of a video conference system in accordance with the principles of the present invention. FIG. 本発明の原理による、例示的な端末利用者の端末を示すブロック図である。FIG. 2 is a block diagram illustrating an exemplary terminal user terminal in accordance with the principles of the present invention. 本発明の原理による、ベースおよび時間エンハンスメントレイヤ(すなわち、0から2)のための符号化器の例示的なアーキテクチャを示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary architecture of an encoder for the base and time enhancement layers (ie, 0 to 2) in accordance with the principles of the present invention. 本発明の原理による、ベース、時間エンハンスメント、およびSNRもしくは空間エンハンスメントレイヤのための例示的なレイヤ化されたピクチャ符号化構造を示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary layered picture coding structure for base, temporal enhancement, and SNR or spatial enhancement layers in accordance with the principles of the present invention. 本発明の原理による、例示的なSNRエンハンスメントレイヤ符号器の構造を示すブロック図である。FIG. 2 is a block diagram illustrating the structure of an exemplary SNR enhancement layer encoder in accordance with the principles of the present invention. 本発明の原理による、例示的な単一ループのSNRビデオ符号器の構造を示すブロック図である。FIG. 2 is a block diagram illustrating the structure of an exemplary single loop SNR video encoder in accordance with the principles of the present invention. 本発明の原理による、空間スケーラビリティビデオ符号器のためのベースレイヤの例示的な構造を示すブロック図である。FIG. 2 is a block diagram illustrating an exemplary structure of a base layer for a spatial scalability video encoder in accordance with the principles of the present invention. 本発明の原理による、空間スケーラビリティエンハンスメントレイヤのビデオ符号器の例示的な構造を示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary structure of a spatial scalability enhancement layer video encoder in accordance with the principles of the present invention. 本発明の原理による、レイヤ間の動き予測を有する空間スケーラビリティエンハンスメントレイヤのビデオ符号器の例示的な構造を示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary structure of a spatial scalability enhancement layer video encoder with inter-layer motion prediction in accordance with the principles of the present invention. 本発明の原理による、例示的なベースレイヤのビデオ復号器を示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary base layer video decoder in accordance with the principles of the present invention. 本発明の原理による、例示的なSNRエンハンスメントレイヤのビデオ復号器を示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary SNR enhancement layer video decoder in accordance with the principles of the present invention. 本発明の原理による、例示的なSNRエンハンスメントレイヤ、単一ループのビデオ復号器を示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary SNR enhancement layer, single loop video decoder in accordance with the principles of the present invention. 本発明の原理による、例示的な空間スケーラビリティエンハンスメントレイヤのビデオ復号器を示すブロック図である。FIG. 2 is a block diagram illustrating an exemplary spatial scalability enhancement layer video decoder in accordance with the principles of the present invention. 本発明の原理による、レイヤ間の動き予測を有する空間スケーラビリティエンハンスメントレイヤのための例示的なビデオ復号器を示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary video decoder for a spatial scalability enhancement layer with inter-layer motion prediction in accordance with the principles of the present invention. 本発明の原理による、例示的な代替のレイヤ化されたピクチャ符号化構造を示すブロック図である。FIG. 4 is a block diagram illustrating an exemplary alternative layered picture coding structure in accordance with the principles of the present invention. 本発明の原理による、例示的なスレッディングアーキテクチャを示すブロック図である。1 is a block diagram illustrating an exemplary threading architecture in accordance with the principles of the present invention. FIG. 本発明の原理による、例示的なスケーラブルなビデオ符号化サーバ(SVCS)を示すブロック図である。1 is a block diagram illustrating an exemplary scalable video coding server (SVCS) in accordance with the principles of the present invention. FIG. 本発明の原理による、SVCSスイッチの動作を示す概略図である。FIG. 3 is a schematic diagram illustrating the operation of an SVCS switch according to the principles of the present invention. 本発明の原理による、例示的なSVCSスイッチレイヤ構成マトリックスの図である。FIG. 3 is an exemplary SVCS switch layer configuration matrix, in accordance with the principles of the present invention. 本発明の原理による、例示的なSVCSネットワークレイヤ構成マトリックスの図である。FIG. 3 is an exemplary SVCS network layer configuration matrix, in accordance with the principles of the present invention.

Claims

A system for video communication between a plurality of endpoints via an electronic communication network and one or more servers, wherein the network links with the plurality of endpoints and servers different quality of service and bandwidth A plurality of channels, the channels including a designated high reliability channel (HRC) and a low reliability channel (LRC);
Comprising a transmitting and receiving terminal located at the endpoint;
At least one transmitting terminal creates at least one scalable encoded video signal for transmission to other terminals in base layer and enhancement layer formats, and at least through the designated HRC Configured to transmit at the base layer,
At least one receiving terminal decodes the scalable encoded video signal layer received via a network channel including a designated HRC, and combines the decoded video signal layer; The scalable transmitted by the transmitting terminal towards the receiving terminal via an electronic communication network channel leading to the receiving terminal, further configured to reconstruct the video for local use A system configured to arbitrate the transfer of a video signal layer encoded in the.

At least one terminal has a raw video signal for encoding and transmission, a stored video signal for encoding and transmission, a synthesized video signal for encoding and transmission, and a prior for transmission The system of claim 1, wherein the system is configured to access at least one of the video signals encoded in the.

An endpoint terminal for video communication with other endpoints via one or more servers located in an electronic communication network, wherein the network links to the multiple endpoints And a bandwidth channel, wherein the channel contains a designated HRC;
At least one scalable video encoder configured to scalable encode at least one video signal in a base layer and enhancement layer format;
A packet multiplexer configured to multiplex layers of the video signal encoded in the base layer and enhancement layer format for transmission over the electronic communication network;
An endpoint terminal configured to designate at least the base layer from the base layer and enhancement layer of the video signal for transmission via the network interface controller and the designated HRC.

4. The endpoint terminal according to claim 3, further comprising an audio signal encoder whose output is connected to the packet multiplexer.

The scalable video encoder is motion compensated, and a block-based codec
A frame memory in which one or more decoded frames are stored for future reference;
A picture type (I, P, or B), as well as a reference controller configured to select a picture in the frame memory to be used as a predictive reference for the currently encoded picture,
The terminal according to claim 3, further configured to perform picture prediction using threads as means for implementing a temporal scalability layer.

6. The terminal of claim 5, wherein the scalable video encoder is configured to create one continuous prediction chain path for the base layer.

The thread
A base layer thread further comprised of several distant pictures, wherein the temporal prediction is performed using one or more previous pictures of the same thread;
Temporal enhancement layer thread composed of remaining pictures, wherein the prediction is performed from one or more preceding base layer pictures and / or from one or more preceding temporal enhancement layer pictures 6. The terminal according to claim 5, wherein the terminal is a picture thread comprising a layer thread.

The picture thread is
A base layer thread further composed of pictures, which are a fixed number of distant pictures, wherein the temporal prediction is performed using the previous frame of the same thread;
A first temporal enhancement layer thread composed of frames intermediate between frames of the base layer thread, wherein prediction is performed from the previous base layer picture or from the previous first temporal enhancement layer picture A first time enhancement layer thread,
A second temporal enhancement layer thread consisting of the remaining pictures, where the prediction is either the immediately preceding second temporal enhancement layer picture, the immediately preceding first temporal enhancement layer picture, or the immediately preceding base layer thread picture 6. The terminal according to claim 5, comprising a second time enhancement layer thread implemented from the above.

The scalable video encoder is configured to encode the base time layer frame with quantization finer than the quantization used for other time layers, whereby the base layer is 6. The terminal according to claim 5, wherein the terminal is encoded more accurately than other layers.

6. The terminal of claim 5, wherein the scalable video encoder is configured to create at least one prediction chain that terminates at an enhancement layer.

6. The terminal of claim 5, wherein the temporal prediction codec further comprises an SNR quality scalability layer encoder.

The SNR quality scalability enhancement layer encoder consists of an input of residual coding error of the base layer obtained by subtracting the decoded base layer frame from the original frame and applying a positive offset; 12. The terminal according to claim 11, which encodes the difference in the same manner as the base layer encoder.

12. The terminal of claim 11, wherein the SNR enhancement layer encoder is further configured to use the predicted path that is different from the predicted path for the base layer or low enhancement layer.

The method of claim 12, wherein the SNR enhancement layer encoder is further configured to omit a DC component of a direct cosine transform (DCT) coefficient when predicting a video frame to be encoded for an SNR quality enhancement layer. The listed terminal.

The SNR enhancement layer encoder is further configured to quantize the DC and surrounding AC DCT coefficients at a coarser level than the remaining DCT coefficients when encoding a video frame for the SNR quality enhancement layer; The terminal according to claim 12.

The SNR quality scalability layer decoder at the receiving endpoint displays the decoded base layer frame at the desired reduced resolution by applying low pass filtering after decoding and by downsampling The terminal according to claim 11, configured as follows.

The terminal according to claim 11, wherein the SNR quality scalability enhancement layer codec comprises an H.264 SVC FGS codec with threading.

The spatial scalability layer codec includes an H.264 SVC FGS codec configured to use a weighted average of a previous enhancement layer picture and a current base layer picture in motion compensated prediction, wherein the weighting comprises the prediction chain 18. The terminal of claim 17, wherein the terminal dynamically changes to include a zero value that is a value that terminates, thereby eliminating drift.

The SNR quality scalability layer encoder encodes the difference between the DCT coefficients before and after quantization by requantizing the difference and applying entropy coding to the requantized difference. The terminal according to claim 11, wherein the terminal is configured.

The temporal prediction codec further comprises a spatial scalability layer encoder configured to low-pass filter and downsample the original input signal, wherein the low resolution is different from an intended display resolution 6. The terminal of claim 5, wherein the terminal can be used as an input to the base layer encoder.

21. The terminal of claim 20, wherein the spatial scalability layer encoder is configured such that a predicted path for an enhancement layer is different from the predicted path for the base layer or a low enhancement layer.

The spatial scalability layer encoder is
Up-sampling the decoded low resolution signal to the original input signal resolution;
Subtracting the original input signal from the upsampled and decoded low resolution signal to obtain a differential signal;
21. The terminal of claim 20, configured to apply an offset to the difference signal and further encode the offset difference signal.

23. The terminal of claim 22, wherein the spatial scalability layer encoder is configured to quantize DC and surrounding DCT AC coefficients more coarsely than the remaining DCT AC coefficients.

The spatial scalability layer encoder is configured to use two predictive encodings when predicting a high-resolution video frame, the first reference picture is a decoded past highest resolution picture, and the second 21. The terminal of claim 20, wherein a reference picture is obtained by first encoding and decoding the downsampled base layer signal and then upsampling it to the original resolution.

The spatial scalability layer encoder is composed of an H.264 AVC encoder using two prediction encodings, the upsampled and decoded base layer frame is inserted as an additional reference frame, and motion vector prediction 25. A terminal according to claim 24, wherein a time and space direct mode is used to increase the compression efficiency.

6. The terminal of claim 5, comprising a base layer encoder and further comprising at least one of an SNR quality layer encoder, a spatial scalability layer encoder, and a temporal enhancement layer encoder.

An endpoint terminal for video communication with other endpoints via one or more servers located in an electronic communication network, wherein the network links different endpoints of quality of service and Providing a bandwidth channel, the channel containing a designated HRC;
A scalable video decoder configured to scalable decode one or more video signals in base layer and enhancement layer formats;
A packet demultiplexer configured to demultiplex the layer of the video signal encoded in the base layer and enhancement layer format after being received via the electronic communication network via a network interface controller An endpoint terminal comprising:

28. The terminal of claim 27, wherein the decoder is comprised of an SNR quality scalability decoder.

The SNR quality scalability decoder is configured to display the decoded base layer frame at a desired reduced resolution by applying low pass filtering after decoding and by downsampling; The terminal according to claim 28.

The SNR quality scalability enhancement layer decoder is configured to add a decoded residual error carried by the enhancement layer data to the decoded base layer frame after subtracting a positive offset. 28. The terminal according to 28.

28. The terminal of claim 27, wherein the decoder further comprises a spatial scalability decoder.

The spatial scalability layer decoder comprises:
Up-samples the decoded base resolution signal to the enhancement layer resolution;
Decoding the offset differential signal carried by the enhancement layer;
32. The terminal of claim 31, configured to subtract an offset from the decoded enhancement layer signal and add the result to the upsampled and decoded base resolution signal.

The spatial scalability layer decoder comprises an H.264 AVC decoder with two predictive coding support, and the upsampled and decoded base layer frame is inserted as an additional reference frame. The terminal described in.

6. The terminal of claim 5, wherein the scalable video encoder is configured to encode an input signal with two or more spatial and / or quality resolutions that can be transmitted simultaneously.

The terminal according to claim 3, wherein the scalable coding structure can be dynamically changed in any scalability dimension according to a preference of a network condition or a receiving endpoint.

A method for communicating between multiple endpoints via an electronic communication network and one or more servers, the network providing different quality of service and bandwidth channels that link to the multiple endpoints And the channel contains the designated HRC;
Encoding the video signal in a base layer and enhancement layer format in a scalable manner;
Multiplexing the layers of the video signal for transmission over the electronic communication network;
Transmitting at least the base layer from the base layer and enhancement layer of the video signal for transmission via the designated HRC.

The step of multiplexing the layers of the video signal for transmission over the electronic communication network further comprises the step of multiplexing an audio signal for transmission over the electronic communication network. 36. The method according to 36.

A method for communicating a video signal picture scalable in base and enhancement layer formats between multiple endpoints over an electronic communication network, comprising:
Selecting a picture type (I, P, or B) for the picture that is currently encoded and a predicted reference picture from the decoded pictures stored in the frame memory;
Creating a temporal scalability layer by performing picture prediction using a thread.

40. The method of claim 38, further comprising creating a continuous predicted chain path for the base layer.

The picture thread includes a base layer thread that includes pictures that are a number of distant pictures, and the method further uses each one or more preceding pictures of the base layer picture to Including performing a time prediction within,
The temporal enhancement layer thread includes the remaining pictures, and the method further predicts in each enhancement layer picture using one or more preceding base layer pictures, or one or more preceding temporal enhancement layer pictures 40. The method of claim 38, comprising the step of:

The picture thread includes a base layer thread that includes a picture that is a number of distant pictures, and the method further comprises performing temporal prediction using a frame immediately preceding the picture;
The first temporal enhancement layer thread includes a frame that is intermediate between the frames of the base layer thread, and the method further predicts from the immediately preceding base layer frame or the immediately preceding first temporal enhancement layer thread picture. And the second temporal enhancement layer thread includes the remaining frames, and the method further includes the immediately preceding second temporal enhancement layer thread picture, the immediately preceding first temporal enhancement layer thread. 40. The method of claim 38, comprising performing temporal prediction using a picture or a frame from a previous base layer thread picture.

Further comprising encoding the base time layer frame with quantization finer than that used for other time layers, whereby the base layer is even more than other layers. 40. The method of claim 38, wherein the method is encoded correctly.

39. The method of claim 38, further comprising creating at least one prediction chain that terminates at the enhancement layer.

40. The method of claim 38, wherein the step of scalable encoding a temporal scalability layer by performing picture prediction using a thread further comprises encoding an SNR quality scalability enhancement layer.

The step of encoding an SNR quality scalability enhancement layer applies a positive offset to the residual encoding error of the base layer obtained by subtracting the decoded base layer frame from the original frame; 45. The method of claim 44, comprising encoding the difference similar to the step of encoding the base layer.

45. The method of claim 44, wherein the step of encoding an SNR quality scalability enhancement layer comprises using the predicted path that is different from the predicted path for the base layer or low enhancement layer.

The method of claim 45, wherein the step of encoding an SNR quality scalability enhancement layer includes omitting a DC component of a discrete cosine transform (DCT) coefficient when encoding a picture for the SNR quality scalability enhancement layer. Method.

The step of encoding an SNR quality scalability enhancement layer, when encoding a video frame for the SNR quality scalability enhancement layer, quantizes the DC and surrounding AC DCT coefficients at a coarser level than the remaining DCT coefficients 46. The method of claim 45, comprising the step of:

The step of encoding an SNR quality scalability enhancement layer further includes applying low pass filtering after decoding and down-sampling at the receiving endpoint to reduce the decoded base layer frame at a desired reduced resolution. 45. The method of claim 44, comprising the step of displaying.

45. The method of claim 44, wherein the step of encoding an SNR quality scalability enhancement layer further comprises an H.264 SVC FGS codec with threading.

Further comprising using an H.264 SVC FGS codec configured to use a weighted average of the previous enhancement layer picture and the current base layer picture in motion compensated prediction, wherein the weighting is a zero value at which the prediction chain ends 51. The method of claim 50, wherein the method dynamically changes to include, thereby eliminating drift.

The step of encoding the SNR quality scalability enhancement layer includes encoding the difference between the DCT coefficients before and after quantization by requantizing the difference and applying entropy coding to the requantized difference. 45. The method of claim 44, comprising the step of:

The step of encoding the temporal quality layer is further intended to apply a low pass filtering to the original input signal and to encode a spatial scalability layer by downsampling, and the low resolution. 39. The method of claim 38, further comprising encoding the downsampled signal different from a display resolution and similar to the base layer.

54. The method of claim 53, wherein the step of encoding a spatial scalability layer comprises using the predicted path for an enhancement layer different from the predicted path for the base layer or low enhancement layer.

Said step of encoding a spatial scalability layer comprises:
Up-sampling the decoded low resolution signal to the original input signal resolution;
Subtracting the original input signal from the upsampled and decoded low resolution signal to obtain a differential signal;
Applying an offset to the difference signal;
54. The method of claim 53, comprising encoding the offset difference signal.

Said step of encoding a spatial scalability layer comprises:
56. The method of claim 55, comprising quantizing the DC and surrounding DCT AC coefficients more coarsely than the remaining DCT AC coefficients.

Said step of encoding a spatial scalability layer comprises:
When predicting a high-resolution video frame, the method includes using two predictive encodings, wherein a first reference picture is a decoded past highest resolution picture, and a second reference picture is the downsampled 54. The method of claim 53, wherein the obtained base layer signal is obtained by first encoding and decoding and then upsampling it to its original resolution.

Said step of encoding a spatial scalability layer comprises:
Using an H.264 AVC encoder with two predictive encodings, wherein the upsampled and decoded base layer frame is inserted as an additional reference frame, and temporal and spatial direct modes of motion vector prediction 58. The method of claim 57, wherein: is used to increase compression efficiency.

39. The method of claim 38, comprising using a base layer encoder, further comprising at least one of an SNR quality layer encoder, a spatial scalability layer encoder, and a temporal enhancement layer encoder.

A method for communicating a video signal picture scalable encoded in base and enhancement layer formats between multiple endpoints and one or more servers via an electronic communication network, comprising:
A scalable video decoder configured to scalable decode one or more video signals in base layer and enhancement layer formats;
A packet demultiplexer configured to demultiplex the layer of the video signal encoded in the base layer and enhancement layer format after being received via the electronic communication network via a network interface controller A method comprising the step of using.

64. The method of claim 60, wherein the decoder is comprised of a SNR quality scalability decoder.

Using the SNR quality scalability decoder to display the decoded base layer frame at a desired reduced resolution by applying low pass filtering after decoding and by downsampling; 64. The method of claim 61, further comprising:

The method further comprises using the SNR quality scalability decoder to add a decoded residual error carried by the enhancement layer data to the decoded base layer frame after subtracting a positive offset. Item 62. The method according to Item 61.

64. The method of claim 60, wherein the decoder further comprises a spatial scalability decoder.

Up-samples the decoded base resolution signal to the enhancement layer resolution;
Decoding the offset difference signal carried by the enhancement layer;
The method further comprising: subtracting an offset from the decoded enhancement layer signal and using the spatial scalability decoder to add the result to the upsampled and decoded base resolution signal. the method of.

The spatial scalability layer decoder comprises an H.264 AVC decoder with two predictive coding support, further comprising inserting an upsampled and decoded base layer frame as an additional reference frame. The method according to 64.

39. The method of claim 38, wherein the step of scalable encoding a video signal comprises encoding the signal with two or more spatial and / or quality resolutions that can be transmitted simultaneously.

40. The method of claim 38, wherein the scalable coding structure can be dynamically changed in any scalability dimension depending on network conditions or preference indications indicated by a receiving endpoint.

A method for video communication between a plurality of endpoints via an electronic communication network and one or more servers, wherein the networks link different endpoints and servers with different quality of service and bandwidth The channel includes a designated high reliability channel (HRC) and a low reliability channel (LRC);
Placing transmitting and receiving terminals at the endpoint;
Creating at least one scalable encoded video signal for transmitting at least one transmitting terminal to other terminals in base layer and enhancement layer formats, and at least said via a designated HRC; Configuring to transmit at the base layer;
At least one receiving terminal to decode the scalable encoded video signal layer received via a network channel including a designated HRC and by combining the decoded video signal layer Configuring the video to be reconfigured for local use;
Configuring the server to arbitrate transfer of the scalable encoded video signal layer transmitted by the transmitting terminal towards the receiving terminal via an electronic communication network channel leading to the receiving terminal; And a method comprising.

Said steps comprising at least one transmitting terminal comprising: a raw video signal for encoding and transmission; a stored video signal for encoding and transmission; a combined video signal for encoding and transmission; 70. The method of claim 69, further comprising: configuring the terminal to access at least one of a pre-encoded video signal for transmission.

71. A computer readable medium comprising a set of instructions for performing the steps set forth in at least one of claims 36 to 70.