JP4921488B2

JP4921488B2 - System and method for conducting videoconference using scalable video coding and combining scalable videoconference server

Info

Publication number: JP4921488B2
Application number: JP2008547785A
Authority: JP
Inventors: エレフセリアディス，アレクサンドロス; シャピロ，オファー; ヴィーガンド，トーマス; チャカレスキー，ジェイコブ
Original assignee: ヴィドヨ，インコーポレーテッド
Priority date: 2005-12-22
Filing date: 2006-12-22
Publication date: 2012-04-25
Anticipated expiration: 2026-12-22
Also published as: CA2633366C; AU2006330457B2; EP1985116A2; WO2007076486A2; CN101341746B; JP2009521880A; AU2006330457A1; WO2007076486A3; CN101341746A; CA2633366A1; EP1985116A4

Description

(関連出願の相互参照)
この出願は、2005年12月22日に出願された米国仮特許出願第60/753,343号の利益を主張する。さらに、この出願は、国際特許出願第PCT/US06/28365号、第PCT/US06/028366号、第PCT/US06/028367号、第PCT/US06/027368号、および第PCT/US06/061815号、ならびに米国仮特許出願第60/778,760号、第60/787,031号、第60/774,094号、および第60/827,469号に関連する。前述の優先権および関連出願のすべては、参照によりその全体を本明細書に組み込まれる。 (Cross-reference of related applications)
This application claims the benefit of US Provisional Patent Application No. 60 / 753,343, filed December 22, 2005. In addition, this application includes International Patent Application Nos. PCT / US06 / 28365, PCT / US06 / 028366, PCT / US06 / 028367, PCT / US06 / 027368, and PCT / US06 / 061815, And US Provisional Patent Applications Nos. 60 / 778,760, 60 / 787,031, 60 / 774,094, and 60 / 827,469. All of the foregoing priority and related applications are incorporated herein by reference in their entirety.

本発明は、マルチメディア技術および遠隔通信に関する。詳細には、本発明は、個人対個人、およびマルチパーティ会議アプリケーションのためにオーディオおよびビデオデータを通信または配信することに関する。より詳細には、本発明は、個人対個人の会議アプリケーション、または何人かの参加者がスケーラブルなビデオ符号化技法を用いて符号化された単一のピクチャに対応するビデオビットストリームの受信をサポートできるに過ぎないこともあるマルチパーティ会議アプリケーションの実装を対象とする。本発明はまた、異なるレベルのサービス品質(QoS)を提供できる通信ネットワーク接続を介して、またエンドユーザが、異なる能力のデバイスおよび通信チャネルを用いて会議アプリケーションにアクセスできる環境で、このようなシステムを実装することを対象とする。 The present invention relates to multimedia technology and telecommunications. In particular, the present invention relates to communicating or delivering audio and video data for person-to-person and multi-party conferencing applications. More specifically, the present invention supports the reception of a video bitstream corresponding to a single picture encoded using a video encoding technique that is scalable by a person-to-person conference application or a number of participants using scalable video coding techniques. Intended for implementation of multiparty conferencing applications that may only be possible. The present invention also provides such a system via a communication network connection that can provide different levels of quality of service (QoS), and in an environment where end users can access conferencing applications using different capabilities of devices and communication channels. It is intended to implement.

テレビ会議システムは、2人以上の遠隔の参加者/エンドポイントが、オーディオおよびビデオを共に使用して、実時間で互いにビデオおよびオーディオを通信することを可能にする。2人の遠隔の参加者だけが含まれる場合、2つのエンドポイント間で、適切な電子ネットワークを介する通信の直接送信を使用することができる。2人を超える参加者/エンドポイントが含まれる場合、すべての参加者/エンドポイントを接続するために、マルチポイント会議ユニット(MCU)、またはブリッジが一般に使用される。MCUは、例えば、スター（星形）構成に接続され得る複数の参加者/エンドポイント間の通信を仲立ち（仲介）する。2人の参加者だけが含まれる場合であっても、その2人の参加者間でMCUを利用することが、なお有利であり得ることに留意されたい。 The video conferencing system allows two or more remote participants / endpoints to communicate video and audio with each other in real time using both audio and video. If only two remote participants are involved, direct transmission of communications over the appropriate electronic network can be used between the two endpoints. If more than two participants / endpoints are involved, a multipoint conference unit (MCU), or bridge, is commonly used to connect all participants / endpoints. The MCU mediates communication between multiple participants / endpoints that can be connected to, for example, a star configuration. Note that even if only two participants are involved, it may still be advantageous to utilize an MCU between the two participants.

テレビ会議のために、参加者/エンドポイントまたは端末は、適切な符号化および復号化デバイスを備える。符号器は、送信エンドポイントにおけるローカルなオーディオおよびビデオ出力を、電子ネットワークを介して信号送信するのに適した符号化形式にフォーマットする。それとは反対に、復号器は、符号化されたオーディオおよびビデオ情報を有する受信信号を、受信エンドポイントにおいてオーディオ再生または画像表示を行うために適切な復号形式へと処理する。 For video conferencing, participants / endpoints or terminals are equipped with appropriate encoding and decoding devices. The encoder formats the local audio and video output at the transmission endpoint into an encoding format suitable for signaling over the electronic network. In contrast, the decoder processes the received signal with encoded audio and video information into a decoding format suitable for audio playback or image display at the receiving endpoint.

従来、(例えば、ビデオウィンドウ内における人の適切な配置を保証するための)フィードバックを提供するために、エンドユーザ自身の画像もまた、自分の画面上に表示される。 Traditionally, the end user's own image is also displayed on his screen to provide feedback (eg, to ensure proper placement of the person in the video window).

通信ネットワークを介する実際のテレビ会議システム実装では、遠隔の参加者間の対話的なテレビ会議の品質は、エンド-ツー-エンドの信号遅延により決定される。200msを超えるエンド-ツー-エンドの遅延は、会議の参加者間の現実感のあるライブな、または自然な対話を妨げる。このように長いエンド-ツー-エンドの遅延は、テレビ会議の参加者が、他の参加者からの伝送中のビデオおよびオーディオデータがそのエンドポイントに到達することができるようにするために、積極的に参加しまたは応答することを不自然に抑制させる。 In an actual videoconferencing system implementation over a communications network, the quality of interactive videoconferencing between remote participants is determined by end-to-end signal delay. End-to-end delays of over 200 ms prevent realistic live or natural interactions between conference participants. This long end-to-end delay is proactive for video conference participants to allow video and audio data in transit from other participants to reach that endpoint. To unnaturally participate or respond.

エンド-ツー-エンドの信号遅延は、取得遅延(例えば、A/Dコンバータのバッファを満たすためにかかる時間に相当する遅延)、符号化遅延、伝送遅延(例えば、データパケットをエンドポイントのネットワークインターフェースコントローラに送信するためにかかる時間に相当する遅延)、およびトランスポート遅延(パケットが、ネットワークを介してエンドポイントからエンドポイントに移動するのにかかる時間に相当する遅延)を含む。さらに、仲介するMCUを介する信号処理時間も、所与のシステムにおける合計のエンド-ツー-エンド遅延の一因となる。 End-to-end signal delay includes acquisition delay (e.g., the time taken to fill the A / D converter buffer), encoding delay, transmission delay (e.g., data packet to endpoint network interface) Delay that corresponds to the time it takes to transmit to the controller) and transport delay (delay that corresponds to the time it takes for the packet to travel from endpoint to endpoint over the network). In addition, signal processing time through the intervening MCU also contributes to the total end-to-end delay in a given system.

MCUの主要なタスクは、単一のオーディオストリームがすべての参加者に送信されるように入力されるオーディオ信号を混合すること、および個々の参加者/エンドポイントにより送信されるビデオフレームまたはピクチャを混合して、各参加者のピクチャを含む共通の複合ビデオフレームストリームにすることである。フレームおよびピクチャという用語は、本明細書では交換可能に使用されること、またさらに、当業者には自明であるが、個々のフィールドとして、もしくは組み合わされたフレームとして、インターレースされたフレームの符号化(フィールドベースまたフレームベースのピクチャ符号化)も、組み込まれ得ることに留意されたい。従来の通信ネットワークシステムで展開されるMCUは、テレビ会議セッションにおけるすべての参加者に対して配信される共通の複合ビデオフレーム中に混合される個々のピクチャのすべてに対して、単一の共通解像度(例えば、CIFまたはQCIF解像度)を提供するだけである。したがって、従来の通信ネットワークシステムは、ある参加者が他の参加者を異なる解像度で見ることを可能にするカスタマイズされたテレビ会議機能を容易に提供することはできない。カスタマイズされた機能では、例えば、他の特定の参加者(例えば、話している参加者)をCIF解像度で見ることを可能にし、他の無言の参加者をQCIF解像で見ることを可能にすることができる。ネットワーク中のMCUは、テレビ会議中の参加者の数と同数の回数だけビデオ混合オペレーションを繰り返すことにより、このようなカスタマイズされた機能を提供するように構成され得る。しかし、このような構成では、MCUのオペレーションは、かなりのエンド-ツー-エンド遅延を生成する。さらに、MCUは、複数のオーディオストリームを復号し、混合し、再符号化し、かつ複数のビデオストリームを復号し、単一のフレームへと(必要に応じて適切なスケーリングで)複合し、それらを再度、単一のストリームに再符号化するための十分なデジタル信号処理機能を有する必要がある。テレビ会議のソリューション(米国94588カリフォルニア州プレザントン市Willow Road、4750のPolycom Inc.(ポリコム社)、および米国10166ニューヨーク州ニューヨーク市パークアベニュー200のTandberg(タンバーグ社)から市販されているシステムなど)は、許容可能な品質および性能レベルを提供するために、専用のハードウェアコンポーネントを使用しなくてはならない。 The main task of the MCU is to mix the incoming audio signals so that a single audio stream is sent to all participants, and the video frames or pictures sent by individual participants / endpoints Mixing them into a common composite video frame stream containing each participant's picture. The terms frame and picture are used interchangeably herein and, furthermore, as will be apparent to those skilled in the art, encoding interlaced frames as individual fields or as combined frames. Note that (field-based or frame-based picture coding) can also be incorporated. MCUs deployed in traditional communication network systems have a single common resolution for all of the individual pictures mixed in a common composite video frame delivered to all participants in the video conference session. It only provides (eg CIF or QCIF resolution). Thus, conventional communication network systems cannot easily provide customized video conferencing functions that allow one participant to view other participants at different resolutions. Customized features, for example, allow other specific participants (e.g. speaking participants) to see at CIF resolution and allow other silent participants to see at QCIF resolution be able to. MCUs in the network may be configured to provide such customized functionality by repeating the video mixing operation as many times as the number of participants in the video conference. However, in such a configuration, the operation of the MCU generates significant end-to-end delay. In addition, the MCU decodes, mixes, re-encodes multiple audio streams, decodes multiple video streams, combines them into a single frame (with appropriate scaling if necessary), and combines them Again, it is necessary to have sufficient digital signal processing capability to re-encode into a single stream. Video conferencing solutions (such as systems available from Willow Road, Pleasanton, California, U.S. 94588, Polycom Inc., 4750, and Tandberg, Park Avenue 200, New York, U.S. 10166) In order to provide acceptable quality and performance levels, dedicated hardware components must be used.

ITU-T勧告H.261、ITU-T勧告H.262|ISO/IEC 13818-2(MPEG-2 Video)のメインプロファイル、ITU-T勧告H.263のベースラインプロファイル、ISO/IEC 11172-2(MPEG-1 Video)、ISO/IEC 14496-2のシンプルプロファイルまたは進んだ(advanced)シンプルプロファイル、ITU-T勧告H.264|ISO/IEC 14496-10(MPEG-4 AVC)のベースラインプロファイルまたはメインプロファイルまたはハイプロファイルでそのビットストリームおよび復号化オペレーションが規格化されている従来のビデオコーデックは、所与の空間解像度およびビットレートで、単一のビットストリームを提供するように指定される。したがって、ある符号化されたビデオ信号に対して、元の符号化空間解像度またはビットレートと比較して、より低い空間解像度、またはより低いビットレートが必要とされる場合、フル解像度信号が受信されかつ復号されて、所望の空間解像度およびビットレートで、おそらくダウンスケールされ、再符号化されなくてはならない。復号化、おそらくはダウンサンプリング、および再符号化のプロセスは、かなりの計算資源を必要とし、また、通常、ビデオ信号に対してかなりの主観的な歪みを付加し、またビデオ送信に対して遅延を付加する。 ITU-T Recommendation H.261, ITU-T Recommendation H.262 | ISO / IEC 13818-2 (MPEG-2 Video) main profile, ITU-T Recommendation H.263 baseline profile, ISO / IEC 11172-2 (MPEG-1 Video), ISO / IEC 14496-2 simple profile or advanced simple profile, ITU-T recommendation H.264 | ISO / IEC 14496-10 (MPEG-4 AVC) baseline profile or Conventional video codecs whose bitstreams and decoding operations are standardized in the main profile or high profile are specified to provide a single bitstream at a given spatial resolution and bitrate. Thus, for a coded video signal, if a lower spatial resolution or lower bit rate is required compared to the original encoded spatial resolution or bit rate, a full resolution signal is received. And must be decoded and possibly downscaled and re-encoded with the desired spatial resolution and bit rate. The decoding, possibly down-sampling, and re-encoding process requires significant computational resources, and usually adds significant subjective distortion to the video signal and delays the video transmission. Append.

さらに、ビデオ通信のための標準のビデオコーデックは、最近の通信ネットワークにより提供される差別化されたQoS機能を利用することが本来できない「単一レイヤ」符号化技法に基づいている。ビデオ通信に対する単一レイヤ符号化技法のさらなる限界は、用途において、より低い空間解像度表示が必要であり、または望ましい場合であってもフル解像度信号が受信され、受信エンドポイントまたはMCUで実施されるダウンスケーリングを用いて復号される必要がある。これは、帯域および計算資源を消費することになる。 Furthermore, standard video codecs for video communications are based on “single layer” coding techniques that are inherently unable to take advantage of the differentiated QoS features provided by modern communications networks. A further limitation of single layer coding techniques for video communication is that the application requires a lower spatial resolution display, or even when desired, a full resolution signal is received and implemented at the receiving endpoint or MCU. It needs to be decoded using downscaling. This consumes bandwidth and computational resources.

前述の単一レイヤビデオコーデックとは反対に、「マルチレイヤ」符号化技法に基づく「スケーラブル」ビデオコーデックでは、所与のソースビデオ信号に対して2つ以上のビットストリームが生成される。すなわち、ベースレイヤ、および1つまたは複数の補強レイヤである。ベースレイヤは、最小の品質レベルにおけるソース信号の基本的な表現であり得る。最小の品質表現は、所与のソースビデオ信号の、品質(すなわち、信号対雑音比(「SNR」))、空間もしくは時間解像度側面、またはこれらの側面の組合せが低減されたものであり得る。1つまたは複数の補強レイヤは、ベースレイヤのSNR、空間もしくは時間解像度側面の品質を高めるための情報に相当する。スケーラブルなビデオコーデックは、異種のネットワーク環境、および/または異種の受信装置を考慮して開発されてきた。 Contrary to the single layer video codec described above, a “scalable” video codec based on a “multilayer” coding technique generates two or more bitstreams for a given source video signal. A base layer and one or more reinforcing layers. The base layer can be a basic representation of the source signal at a minimum quality level. The minimum quality representation may be a reduced quality (ie, signal to noise ratio (“SNR”)), spatial or temporal resolution aspect, or a combination of these aspects for a given source video signal. One or a plurality of enhancement layers correspond to information for enhancing the quality of the SNR, spatial or temporal resolution aspect of the base layer. Scalable video codecs have been developed considering heterogeneous network environments and / or heterogeneous receivers.

スケーラブルな符号化は、ITU-T勧告H.262|ISO/IEC 13818-2(MPEG-2 Video)のSNRスケーラブル、または空間的スケーラブル、またハイプロファイルなどの規格の一部であった。しかし、このような「スケーラブル」ビデオコーデックのテレビ会議用途の実際の使用は、スケーラブルな符号化に関連して増加するコストおよび複雑さにより、ならびにビデオに対して適切な高帯域幅のIPベース通信チャネルの広く普及した可用性の欠如により妨げられてきた。 Scalable coding has been part of the ITU-T Recommendation H.262 | ISO / IEC 13818-2 (MPEG-2 Video) SNR scalable, spatial scalable, and high profile standards. However, the actual use of such “scalable” video codecs in video conferencing applications is due to the increased cost and complexity associated with scalable coding, as well as high bandwidth IP-based communications appropriate for video. It has been hampered by the lack of widespread availability of channels.

参照により本明細書に組み込まれる、同時係属であり本願の出願人に譲渡された国際特許出願第PCT/US06/02836号は、特にテレビ会議用途に向けた実際的なスケーラブルビデオ符号化技法を述べている。さらに、参照により本明細書に組み込まれる、同時係属であり本願の出願人に譲渡された国際特許出願第PCT/US06/02835号は、テレビ会議用途に対するスケーラブルなビデオ符号化技法の特徴を利用し、利益を得るように設計された会議サーバアーキテクチャを述べている。参照により本明細書に組み込まれる、同時係属であり本願の出願人に譲渡された国際特許出願第PCT/US06/061815号は、テレビ会議用途に対するスケーラブルなビデオ符号化技法の特徴を利用し、利益を得るように設計された会議サーバアーキテクチャに、エラー耐性、レイヤ切換え、およびランダムアクセス能力を提供するための技法を述べている。 International Patent Application No. PCT / US06 / 02836, co-pending and assigned to the present applicant, incorporated herein by reference, describes practical scalable video coding techniques specifically for video conferencing applications. ing. In addition, the co-pending and assigned international patent application PCT / US06 / 02835, incorporated herein by reference, takes advantage of the features of scalable video coding techniques for videoconferencing applications. Describes a conference server architecture designed to benefit. International Patent Application No. PCT / US06 / 061815, co-pending and assigned to the present applicant, incorporated herein by reference, takes advantage of the features of scalable video coding techniques for video conferencing applications. Describes a technique for providing error-tolerance, layer switching, and random access capabilities to a conference server architecture designed to achieve

現在、前に規格化されたスケーラブルなビデオコーデックよりもさらに効率的なトレードオフを提供するITU-T勧告H.264|ISO/IEC 14496-10規格の拡張が検討されている(Annex G、Scalable Video Coding - SVC)。ビデオ符号化の研究および規格化におけるさらなる開発は、MCUにおけるエラー耐性およびビデオ混合、すなわち、複数の入力ビデオを1つの出力ビデオに複合するための複数のスライスグループの概念を含む。(S. Wenger and M. Horowitzによる、「Scattered Slices: A New Error Resilience Tool for H.26L」、JVT-B027、ITU-T SG16/Q.6およびISO/IEC JTC 1/SC29/WG 11およびITU-T勧告H.264|ISO/IEC 14496-10の共同ビデオチーム(JVT)の文書を参照のこと)。すべての入力ビデオ信号が、ITU-T勧告H.264|ISO/IEC 14496-10を用いて符号化された場合、様々な入力信号は、別個のスライスグループとしてMCUの出力ピクチャ中に配置され得るので、復号化および再符号化がMCU中で必要とされない可能性がある。(M. M. Hannuksela およびY. K. Wangによる、「Coding of Parameter Sets」、JVT-C078、ITU-T SG16/Q.6およびISO/IEC JTC 1/SC 29/WG 11の共同ビデオチーム(JVT)の文書を参照のこと)。 Currently, an extension to the ITU-T Recommendation H.264 | ISO / IEC 14496-10 standard is being considered that provides a more efficient tradeoff than the previously standardized scalable video codec (Annex G, Scalable Video Coding-SVC). Further developments in video coding research and standardization include error resilience and video mixing in MCUs, ie the concept of multiple slice groups to combine multiple input videos into one output video. (`` Scattered Slices: A New Error Resilience Tool for H.26L '' by S. Wenger and M. Horowitz, JVT-B027, ITU-T SG16 / Q.6 and ISO / IEC JTC 1 / SC29 / WG 11 and ITU (See -T Recommendation H.264 | ISO / IEC 14496-10 Joint Video Team (JVT) document). If all input video signals are encoded using ITU-T Recommendation H.264 | ISO / IEC 14496-10, the various input signals can be placed in the MCU's output picture as separate slice groups. As such, decoding and re-encoding may not be required in the MCU. (See "Coding of Parameter Sets" by MM Hannuksela and YK Wang, JVT-C078, ITU-T SG16 / Q.6 and ISO / IEC JTC 1 / SC 29 / WG 11 Joint Video Team (JVT) documents. )

テレビ会議用途のための会議サーバまたはMCUアーキテクチャの改善に対して、現在、検討が行われている。特に、複数のスライスグループなどの符号化ドメイン複合技法を用いて、あり得るサーバで生成されたデータと共に、1つまたは複数の入力ビデオ信号を単一の出力ビデオ信号に複合するためのサーバアーキテクチャを開発することに対して関心が向けられている。望ましい会議サーバアーキテクチャは、画面分割表示(continuous presence)、個人的なビューもしくはレイアウト、レートマッチング、エラー耐性、およびランダムエントリなどの望ましいテレビ会議機能をサポートし、また従来のMCUの複雑さおよび遅延オーバヘッドを回避することになる。 Improvements are currently underway to improve the conference server or MCU architecture for video conferencing applications. In particular, a server architecture for combining one or more input video signals into a single output video signal, together with possible server generated data, using coding domain combining techniques such as multiple slice groups. There is interest in developing. The preferred conference server architecture supports desirable video conferencing features such as continuous presence, personal view or layout, rate matching, error resilience, and random entry, and the complexity and delay overhead of traditional MCUs Will be avoided.

テレビ会議を行うためのシステムおよび方法が提供される。各テレビ会議参加者は、会議ブリッジMCUまたはサーバに符号化データビットストリームを送信する。符号化データビットストリームは、単一レイヤ、またはスケーラブルなビデオ符号化(SVC)データ、および/またはスケーラブルなオーディオ符号化(SAC)データビットストリームとすることができ、複数の品質をそこから導くことができる。MCUまたはサーバ(例えば、以下の「複合スケーラブルビデオ符号化サーバ」(CSVCS))は、送信会議参加者からの入力ビデオ信号を、受信参加者に転送される単一の複合出力ビデオ信号へと合成するように構成される。CSVCSは、特に、入力信号を復号化、再スケーリング、および再符号化することなく、出力ビデオ信号ピクチャを合成するように構成され、それにより、エンド-ツー-エンドの遅延はわずかなものとなるか、あるいは全くなくなる。CSVCSのこの「ゼロ遅延」アーキテクチャは、有利には、カスケード構成でその使用を可能にする。CSVCSの複合化された出力ビットストリームは、単一の復号器でそれを復号化できるようにする。 Systems and methods for conducting a video conference are provided. Each video conference participant sends an encoded data bitstream to the conference bridge MCU or server. The encoded data bitstream can be a single layer, or scalable video coding (SVC) data, and / or a scalable audio coding (SAC) data bitstream, from which multiple qualities are derived. Can do. The MCU or server (e.g., “Complex Scalable Video Coding Server” (CSVCS) below) combines the input video signal from the sending conference participant into a single composite output video signal that is forwarded to the receiving participant. Configured to do. CSVCS is specifically configured to synthesize the output video signal picture without decoding, rescaling, and re-encoding the input signal, thereby minimizing end-to-end delay Or nothing at all. This “zero delay” architecture of CSVCS advantageously allows its use in a cascaded configuration. The CSVCS composite output bitstream allows it to be decoded with a single decoder.

テレビ会議用途では、各参加者は、複数のレイヤ(例えば、ベースレイヤ、およびSVCを用いて符号化される1つまたは複数の補強レイヤ)を有するスケーラブルなデータビットストリームを、対応する数の物理チャネルまたは仮想チャネルを介してCSVCSに送信する。何人かの参加者はまた、単一レイヤビットストリームを送信することもできる。CSVCSは、特定の受信参加者の特性および/または設定に基づいた要件に従って、各参加者からのスケーラブルなビットストリームの部分を選択することができる。その選択は、例えば、特定の受信参加者の帯域幅、および所望のビデオ解像度に基づくことができる。 For video conferencing applications, each participant receives a scalable data bitstream having multiple layers (e.g., a base layer and one or more augmented layers encoded using SVC) with a corresponding number of physical Send to CSVCS over channel or virtual channel. Some participants can also send single layer bitstreams. CSVCS can select a portion of the scalable bitstream from each participant according to requirements based on the characteristics and / or settings of the particular receiving participant. The selection can be based, for example, on the bandwidth of the particular receiving participant and the desired video resolution.

CSVCSは、選択された入力スケーラブルビットストリーム部分を、1つ(または複数)の復号器によって復号化され得る1つ(または複数)の出力ビデオビットストリームへと合成する。出力ビデオビットストリームに対してSVCが使用される場合、その複合化は、出力ストリームが妥当性のあるSVCビットストリームであるように、可能な補足的なレイヤデータの生成と併せて、各入力ビデオ信号を、出力ビデオ信号の異なるスライスグループのスライスに割り当てることにより達成される。CSVCSは、信号処理を用いずに、或いは、最小の信号処理を用いて、複合出力ビデオ信号を生成するように構成される。CSVCSは、例えば、出力信号を合成するために、適切なパケットを出力ビットストリームのアクセスユニット中へ選択的に多重化することができるように、入力されるデータのパケットヘッダを読取り、次いで、参加者のそれぞれに対して、任意の生成されたレイヤデータと共に、合成された出力信号を送信するように構成され得る。 CSVCS combines the selected input scalable bitstream portions into one (or more) output video bitstreams that can be decoded by one (or more) decoders. If SVCs are used for the output video bitstream, the decoding will be performed for each input video along with the generation of possible supplemental layer data so that the output stream is a valid SVC bitstream. This is accomplished by assigning signals to slices in different slice groups of the output video signal. CSVCS is configured to generate a composite output video signal without signal processing or with minimal signal processing. CSVCS reads the packet header of the incoming data so that the appropriate packet can be selectively multiplexed into the access unit of the output bitstream, for example, to synthesize the output signal, and then join For each of the parties, it may be configured to transmit the combined output signal along with any generated layer data.

テレビ会議の場面では、入力ビデオ信号コンテンツは、所与の瞬間において、時間内に出力ビットストリーム中のピクチャのすべての領域をカバーするのに十分である場合も、十分でない場合もある。その不十分さは、例えば、入力ビデオ信号の異なる時間解像度、入力ビデオ信号の時間的なサンプリング間のシフト、および出力ビデオ信号の不完全なフィリング(filling)に起因する可能性がある。CSVCSは、エンド-ツー-エンドの遅延を最小化し、または遅れて到来する入力ビデオ信号により引き起こされる他の問題を最小化するために、出力ビデオ信号のより高い時間解像度を生成することにより、不十分なピクチャ領域範囲の問題を改善するように構成され得る。例えば、CSVCSは、入力ビデオ信号コンテンツが存在していない、または利用できない出力ビデオ信号のこれらの部分に対して、アクセス可能な記憶媒体から取得される事前に符号化されたスライスを挿入するように構成することができる。事前に符号化されたスライスは、出力ピクチャの特定のレイアウトに従って、CSVCSにより計算され、または事前に計算され得るヘッダおよび符号化スライスデータから構成することができる。代替的には、CSVCSは、受信エンドポイントに対して、前に符号化されたピクチャを単に複製するように命令する符号化ピクチャデータを挿入することにより、より高い時間解像度で入力ビデオ信号を処理することができる。このような符号化ピクチャデータは、数バイトの程度のきわめて短い長さを有することに留意されたい。 In a videoconference scene, the input video signal content may or may not be sufficient to cover all areas of the picture in the output bitstream in time at a given moment. The insufficiency may be due to, for example, different temporal resolutions of the input video signal, shifts between temporal samplings of the input video signal, and incomplete filling of the output video signal. CSVCS minimizes end-to-end delays or generates higher temporal resolution of the output video signal in order to minimize other problems caused by the incoming video signal coming late. It can be configured to ameliorate the problem of sufficient picture area coverage. For example, CSVCS inserts pre-encoded slices obtained from accessible storage media for those portions of the output video signal where input video signal content is not present or not available Can be configured. A pre-coded slice may be composed of header and coded slice data that may be computed by CSVCS or pre-computed according to the specific layout of the output picture. Alternatively, CSVCS processes the input video signal with higher temporal resolution by inserting encoded picture data that instructs the receiving endpoint to simply duplicate the previously encoded picture. can do. Note that such coded picture data has a very short length on the order of a few bytes.

本発明によるテレビ会議の例示的な実施形態は、差別化されたサービス品質(QoS)が提供される(すなわち、必要な合計帯域幅の何らかの部分に対して高い信頼性の通信チャネルを提供する)通信ネットワーク接続、ビデオコーデック、CSVCS、およびエンドユーザ端末を含むことができる。送信参加者のためのビデオコーデックは、異なる伝送帯域幅レベルで時間、品質、または空間解像度の点で共にスケーラビリティを提供する単一レイヤビデオまたはスケーラブルビデオである。受信参加者の少なくとも1人に対するビデオコーデックは、スケーラブルなビデオ復号化をサポートする。送信参加者および受信参加者で使用されるエンドユーザ端末は、ビデオ復号器の複数のインスタンス、およびビデオ符号器の少なくとも1つのインスタンスを動作させることのできる、専用のハードウェアシステムまたは汎用PCとすることができる。例示的なシステムの実装は、従来のMCUの機能および/または他の会議サーバ(第PCT/US06/28366号に記載のSVCSなど)の機能を、本明細書で述べるCSVCSの機能と組み合わせることができる。このように組み合わせたシステムでは、MCU、SVCS、およびCSVCS機能は、テレビ会議セッションにおける異なる部分またはエンティティにサービスするために、選択的に、個々に、または組み合わせて使用され得る。 Exemplary embodiments of video conferencing in accordance with the present invention provide differentiated quality of service (QoS) (i.e., provide a reliable communication channel for some portion of the total bandwidth required). Communication network connections, video codecs, CSVCS, and end user terminals can be included. Video codecs for transmitting participants are single layer video or scalable video that provides scalability in terms of time, quality, or spatial resolution at different transmission bandwidth levels. A video codec for at least one of the receiving participants supports scalable video decoding. The end-user terminals used by the sending and receiving participants are dedicated hardware systems or general purpose PCs capable of operating multiple instances of the video decoder and at least one instance of the video encoder be able to. An exemplary system implementation combines the functionality of a conventional MCU and / or other conference server (such as the SVCS described in PCT / US06 / 28366) with the functionality of the CSVCS described herein. it can. In such a combined system, MCU, SVCS, and CSVCS functions may be used selectively, individually, or in combination to serve different parts or entities in a video conference session.

CSVCSの機能は、SVCSの機能を補完することができる。CSVCSは、SVCSの機能および利点のいくつかまたはすべてを有するように構成することができる。しかし、CSVCSは、SVCSが行うように各エンドポイントに複数のSVCストリームを送信する代わりに、個々のストリームが異なるスライスグループに割り当てられる単一の出力SVCストリーム中に、個々のストリームをカプセル化し、または合成する点で、少なくともSVCSとは異なる。CSVCSは、その場合、すべての目的に対してSVCSであると見なすことができ、出力段はさらに、出力ビットストリームが準拠していることを保証するために必要となり得る追加のレイヤデータの生成と併せて、スライスグループベースの割当ての追加のプロセスを含む。すべてのSVCS機能(例えば、レートマッチング、個人化されたレイアウト、エラー耐性、ランダムアクセス、およびレイヤ切換え、レート制御)は、したがって、CSVCSでサポートされ得ること、またCSVCSから送信されるパケットの数は、同一の会議セットアップにおいてSVCSから送信されるはずの数とほぼ同じであることに留意されたい。 The CSVCS function can complement the SVCS function. CSVCS can be configured to have some or all of the features and benefits of SVCS. However, instead of sending multiple SVC streams to each endpoint as SVCS does, CSVCS encapsulates each stream in a single output SVC stream where each stream is assigned to a different slice group, Or at least different from SVCS in terms of synthesis. CSVCS can then be considered as an SVCS for all purposes, and the output stage can further generate additional layer data that may be needed to ensure that the output bitstream is compliant. In addition, it includes the additional process of slice group based assignment. All SVCS functions (e.g., rate matching, personalized layout, error resilience, random access, and layer switching, rate control) can therefore be supported by CSVCS, and the number of packets transmitted from CSVCS is Note that it is approximately the same as the number that should be sent from the SVCS in the same conference setup.

本発明のさらなる特徴、性質、および様々な利点は、以下の好ましい諸実施形態の詳細な説明および添付の図面からさらに明らかとなろう。 Further features of the invention, its nature and various advantages will be more apparent from the following detailed description of the preferred embodiments and the accompanying drawings.

本発明は、符号化ドメインでピクチャの複合化を提供するサーバを用いたスケーラブルなビデオ符号化を使用するテレビ会議システムを実施するためのシステムおよび方法を提供する。本システムおよび方法は、単一レイヤ符号化またはスケーラブル符号化技法を用いて、送信側テレビ会議参加者により符号化されるビデオおよびオーディオデータを送達する。スケーラブルなビデオ符号化技法は、ソースデータをいくつかの異なるビットストリーム(例えば、ベースレイヤおよび補強レイヤのビットストリーム)に符号化し、それらは、様々な時間解像度、品質解像度(すなわち、SNRの点で)、またビデオの場合、空間解像で、元の信号の表現を提供する。受信参加者は、スケーラブルなビデオ符号化技法を用いて符号化され、また様々な入力信号に対する複数のスライスグループ機能を含むビットストリームを復号することができる。 The present invention provides a system and method for implementing a video conferencing system that uses scalable video coding with a server that provides picture decoding in the coding domain. The systems and methods deliver video and audio data that is encoded by a transmitting video conference participant using single layer encoding or scalable encoding techniques. Scalable video coding techniques encode source data into a number of different bitstreams (e.g., base layer and enhancement layer bitstreams) that can vary in time resolution, quality resolution (i.e., in terms of SNR). ), And in the case of video, the spatial resolution provides a representation of the original signal. A receiving participant can decode a bitstream that is encoded using scalable video coding techniques and that includes multiple slice group functions for various input signals.

送信参加者またはエンドポイントと、受信参加者またはエンドポイントとの間の通信経路中に、複数のサーバが存在することができる。このような場合、経路中の少なくとも最後のサーバは、送信参加者から入力されるビデオピクチャを、スケーラブルなビデオ符号化技法を用いて符号化された単一の複合出力ピクチャへと合成し、その複合出力ピクチャを受信参加者に送信することになる。重要なことは、サーバにおける、もしくはサーバによる合成プロセスは、送信参加者から受信したピクチャデータの復号化および再符号化を必要としないが、出力ビットストリームが、スケーラブルなビデオ復号器の要件を準拠していることを保証するための追加のレイヤデータの生成を必要とする場合のあることである。 There can be multiple servers in the communication path between the sending participant or endpoint and the receiving participant or endpoint. In such a case, at least the last server in the path combines the video picture input from the sending participant into a single composite output picture encoded using scalable video coding techniques, and The composite output picture will be sent to the receiving participant. Importantly, the compositing process at or by the server does not require decoding and re-encoding of the picture data received from the sending participant, but the output bitstream complies with the requirements of a scalable video decoder It may be necessary to generate additional layer data to ensure that

参照用として、また本発明の理解を助けるために、本明細書で述べられる本発明の実施形態(以降、「SVC実施形態」とする)に対して、ベースレイヤビットストリームは、ITU-TおよびISO/IEC JTC 1に指定されるITU-T勧告H.264|ISO/IEC 14496-10(MPEG4-AVC)、「Advanced video coding for generic audiovisual services(オーディオビジュアルサービス全般のための高度ビデオ符号化方式)」、ITU-T勧告H.264、およびISO/IEC 14496-10(MPEG4-AVC)に準拠しているものと仮定する。さらに、補強レイヤビットストリームは、ITU-T勧告H.264|ISO/IEC 14496-10(MPEG4-AVC)(Annex G、Scalable Video Coding、以降「SVC」)のスケーラブルな拡張に準拠するものと仮定する。SVCコーデックの使用は、例えば、入力ビデオ信号の可変ピクチャサイズがMCUの出力ビデオピクチャ中に存在するように要求される場合に有益であり得る。H.264 AVCとSVC規格は異なることに留意されたい。SVCは、H.264の2007年度版に現れるH.264の別のAnnexである。本発明で述べられる実施形態の場合、H.264 AVCはスケーラブルなコーデックベースレイヤに対して使用され、一方、H.264 SVCはスケーラブルなコーデック補強レイヤに対して使用される。しかし、説明では便宜上、ベースレイヤに対して使用される(H.264 AVC)、また補強レイヤに対して使用される(H.264 SVC)スケーラブルなビデオコーデックは、本明細書では、まとめて「SVC」コーデックと呼ぶことができる。H.264 AVCは単一レイヤコーデックであると見なされるが、時間次元では、スケーラビリティを提供することにさらに留意されたい。本発明で述べられる実施形態におけるH.264 AVCおよびH.264 SVCコーデックの使用は、例示的なものに過ぎないこと、またピクチャを複合するのに適した他のコーデックを、本発明の原理に従って、その代わりに使用できることもまた理解されよう。 For reference purposes and to assist in understanding the present invention, for the embodiments of the present invention described herein (hereinafter referred to as “SVC embodiments”), the base layer bitstream is ITU-T and ITU-T Recommendation H.264 | ISO / IEC 14496-10 (MPEG4-AVC) specified in ISO / IEC JTC 1, “Advanced video coding for generic audiovisual services” ) ", ITU-T Recommendation H.264, and ISO / IEC 14496-10 (MPEG4-AVC). Furthermore, it is assumed that the enhancement layer bitstream conforms to the scalable extension of ITU-T Recommendation H.264 | ISO / IEC 14496-10 (MPEG4-AVC) (Annex G, Scalable Video Coding, hereafter “SVC”) To do. The use of the SVC codec may be beneficial, for example, when a variable picture size of the input video signal is required to be present in the output video picture of the MCU. Note that the H.264 AVC and SVC standards are different. SVC is another Annex of H.264 that appears in the 2007 version of H.264. For the embodiment described in the present invention, H.264 AVC is used for the scalable codec base layer, while H.264 SVC is used for the scalable codec enhancement layer. However, for the sake of convenience, the scalable video codecs used for the base layer (H.264 AVC) and for the enhancement layer (H.264 SVC) are collectively referred to herein as “ It can be called the “SVC” codec. Note further that H.264 AVC is considered a single layer codec, but provides scalability in the time dimension. The use of the H.264 AVC and H.264 SVC codecs in the embodiments described in the present invention is exemplary only, and other codecs suitable for combining pictures are in accordance with the principles of the present invention. It will also be appreciated that it can be used instead.

図1は、マルチポイントおよびポイント-ツー-ポイントで会議を行う用途において、電子またはコンピュータネットワーク環境で実施することが可能な、ピクチャを複合するための例示的なシステム100を示す。システム100は、会議参加者またはクライアント120、130、および140に対するカスタマイズされたデータの送達を調整するために、1つまたは複数のネットワークサーバ(例えば、複合スケーラブルテレビ会議サーバ(CSVCS)サーバ110)を使用する。CSVCSサーバ110は、例えば、他の会議参加者に送信するためのエンドポイント140により生成されたビデオストリームの送達を調整することができる。システム100では、SVC技法を用いて、ビデオストリーム150は、まず、適切に符号化され、またはスケールダウンされて、様々なデータコンポーネントまたはレイヤになる。複数のデータレイヤは、異なる特性または機能(例えば、空間解像度、フレームレート、ピクチャ品質、信号対雑音比(SNR)など)を有することができる。データレイヤの異なる特性または機能は、例えば、変化する個々のユーザ要件、および電子ネットワーク環境中のインフラストラクチャ仕様(例えば、CPU能力、ディスプレイサイズ、ユーザのプリファレンス、およびビットレート)を考慮して、適切に選択することができる。 FIG. 1 illustrates an example system 100 for compositing pictures that can be implemented in an electronic or computer network environment in multipoint and point-to-point conferencing applications. The system 100 includes one or more network servers (e.g., a complex scalable videoconferencing server (CSVCS) server 110) to coordinate delivery of customized data to conference participants or clients 120, 130, and 140. use. The CSVCS server 110 can coordinate the delivery of the video stream generated by the endpoint 140 for transmission to other conference participants, for example. In system 100, using SVC techniques, video stream 150 is first appropriately encoded or scaled down into various data components or layers. The multiple data layers can have different characteristics or functions (eg, spatial resolution, frame rate, picture quality, signal-to-noise ratio (SNR), etc.). Different characteristics or functions of the data layer take into account, for example, changing individual user requirements and infrastructure specifications in the electronic network environment (e.g. CPU capacity, display size, user preferences, and bit rate) You can choose appropriately.

CSVCS 110は、国際特許出願第PCT/US06/028366号に記載のスケーラブルなテレビ会議サーバ(SVCS)およびスケーラブルなオーディオ会議サーバ(SACS)と同様のスケーラブルなビデオ信号処理機能を有することができる。しかし、CSVCS 110は、特に、複数のスライスグループを用いて複数の入力ビデオ信号を1つの出力ビデオ信号に複合するために、H.264 AVCおよびH.264 SVCコーデックを使用するようにさらに構成される。 The CSVCS 110 can have a scalable video signal processing function similar to the scalable video conference server (SVCS) and the scalable audio conference server (SACS) described in International Patent Application No. PCT / US06 / 028366. However, CSVCS 110 is further configured to use the H.264 AVC and H.264 SVC codecs, especially to combine multiple input video signals into one output video signal using multiple slice groups. The

システム100では、クライアント120、130、および140はそれぞれ、対話式会議に適切な端末を使用することができる。端末は、ヒューマンインターフェース入出力デバイス(例えば、カメラ、マイクロフォン、ビデオディスプレイ、およびスピーカ)、および符号器、復号器、マルチプレクサ(MUX)、デマルチプレクサ(DEMUX)などの他の信号処理コンポーネントを含むことができる。 In system 100, clients 120, 130, and 140 can each use a terminal that is appropriate for an interactive conference. Terminals may include human interface input / output devices (e.g., cameras, microphones, video displays, and speakers) and other signal processing components such as encoders, decoders, multiplexers (MUX), demultiplexers (DEMUX), etc. it can.

さらに、同時係属の国際特許出願第PCT/US06/028366号に記載されているように、例示的な端末では、カメラおよびマイクロフォンは、他の会議参加者に送信するために、参加者のビデオおよびオーディオ信号をそれぞれ、捕捉するように設計されている。反対に、ビデオディスプレイおよびスピーカは、他の参加者から受信したビデオおよびオーディオ信号をそれぞれ、表示し、再生するように設計されている。ビデオディスプレイはまた、参加者/端末の自分のビデオを任意選択で表示するように構成することもできる。端末におけるカメラおよびマイクロフォンは、アナログ-デジタル変換器(AD/C)に結合することができ、それは、次にその各符号器に結合される。符号器は、信号の送信に必要なビットレートを最小化するために、ローカルなデジタル信号を圧縮する。符号器の出力データは、IPベースのネットワークを介して送信するために、(例えば、パケットMUXにより)RTPパケットに「パケット化」することができる。パケットMUXは、RTPプロトコルを用いて、従来の多重化を実施することができ、また任意の必要なQoSに関連するプロトコル処理を実施することもできる。例えば、同時係属の国際特許出願第PCT/US06/061815号に記載のように、QoSサポートは、信頼できる送達のために、ベースレイヤの少なくとも最も低い時間レベルを復号化するのに必須のパケットのマーキングと共に、肯定応答および/または否定応答により提供され得る。端末のデータストリームのそれぞれは、それ自体の仮想チャネル、すなわち、IP用語のポート番号で送信され得る。 Further, as described in co-pending International Patent Application No. PCT / US06 / 028366, in an exemplary terminal, the camera and microphone can transmit the participant's video and video for transmission to other conference participants. Each audio signal is designed to be captured. Conversely, video displays and speakers are designed to display and play video and audio signals received from other participants, respectively. The video display can also be configured to optionally display the participant's / terminal's own video. The camera and microphone at the terminal can be coupled to an analog-to-digital converter (AD / C), which is then coupled to its respective encoder. The encoder compresses the local digital signal in order to minimize the bit rate required for signal transmission. The encoder output data can be “packetized” into RTP packets (eg, via packet MUX) for transmission over an IP-based network. The packet MUX can perform conventional multiplexing using the RTP protocol and can also perform any necessary QoS related protocol processing. For example, as described in co-pending International Patent Application No. PCT / US06 / 061815, QoS support is provided for packets required to decode at least the lowest time level of the base layer for reliable delivery. It can be provided with an acknowledgment and / or a negative response along with the marking. Each of the terminal data streams may be transmitted on its own virtual channel, i.e. port number in IP terminology.

本発明のSVC実施形態の実施において、システム100は、CSVCSへの入力ビットストリームに対してはAVCまたはSVCコーデックを使用し、またCSVCS 110からの出力ビデオビットストリームに対してはSVCを使用することにより、出力ピクチャの複合において、複数のスライスグループの特性を利用する。しかし、システム100におけるオーディオ信号は、出力ピクチャの複合とは独立して、例えば、ITU-T勧告G.711、またはISO/IEC 11172-3(MPEG-1 Audio)に記載の技法など、当技術分野で知られた任意の適切な技法を用いて符号化することができる。 In the implementation of the SVC embodiment of the present invention, the system 100 uses an AVC or SVC codec for the input bitstream to the CSVCS and uses an SVC for the output video bitstream from the CSVCS 110. Thus, the characteristics of a plurality of slice groups are used in the composite of output pictures. However, the audio signal in the system 100 is independent of the composite of the output picture, such as the technique described in ITU-T recommendation G.711 or ISO / IEC 11172-3 (MPEG-1 Audio). It can be encoded using any suitable technique known in the art.

図2は、CSVCS 110により提供される例示的な出力ビデオピクチャ200を示しており、それは複数のスライスグループ(例えば、スライスグループ1、2、3、4)の複合されたものである。スライスグループ間の区分または境界は、破線で図2に示されている。スライスグループ1、2、3、4は、ITU-T勧告H.264|ISO/IEC 14496-10におけるシンタックス構造とすることができる。あるピクチャに対する特定のスライスグループの割当ては、ITU-T勧告H.264|ISO/IEC 14496-10ビットストリームのピクチャパラメータセット(PPS)において、ピクチャごとに1つずつ、ビットストリーム中で指定することができる。PPSは、ビットストリームの一部として、帯域内でまたは帯域外で搬送することができる。PPSを帯域内で搬送することは、PPSがビットストリームのアクセスユニット中に多重化されることが必要となる。反対に、PPSを帯域外で搬送することは、PPS伝送のために、別個の伝送チャネルが使用されること、または伝送シナリオで復号器が使用される前に、PPSが復号器中に実装されることが必要となり得る。最高256個の異なるPPSを使用することが可能である。あるピクチャに対して、どのPPSを使用する必要があるかのシグナリングは、番号参照によりスライスヘッダ中で示され得る。 FIG. 2 shows an exemplary output video picture 200 provided by CSVCS 110, which is a composite of multiple slice groups (eg, slice groups 1, 2, 3, 4). The partitions or boundaries between slice groups are indicated in FIG. 2 by dashed lines. Slice groups 1, 2, 3, and 4 can have a syntax structure in ITU-T recommendation H.264 | ISO / IEC 14496-10. The assignment of a specific slice group to a picture shall be specified in the bitstream, one per picture, in the picture parameter set (PPS) of ITU-T Recommendation H.264 | ISO / IEC 14496-10 bitstream. Can do. The PPS can be carried in-band or out-of-band as part of the bitstream. Carrying the PPS in-band requires that the PPS be multiplexed into the access unit of the bitstream. Conversely, carrying PPS out-of-band means that a separate transmission channel is used for PPS transmission, or the PPS is implemented in the decoder before the decoder is used in a transmission scenario. It may be necessary to It is possible to use up to 256 different PPS. The signaling of which PPS needs to be used for a picture may be indicated in the slice header by number reference.

図3は、CSVCS 110により生成された出力ビデオピクチャ200(図2)のスライスグループに対する入力ビデオ信号またはピクチャの例示的な割当てを示す。入力ビデオ信号の割当ては、スライスヘッダを変更し、出力ビデオのスライスグループに割り当てることにより、圧縮ドメインで達成することができる。例えば、図3に示す割当てでは、入力ビデオ信号0がスライスグループ0に割り当てられ、入力ビデオ信号1がスライスグループ1に割り当てられ、入力ビデオ信号2がスライスグループ2に割り当てられ、また入力ビデオ信号3および4が共にスライスグループ3に割り当てられる。その割当ては、入力ビデオ信号を、出力ピクチャ中のスライスグループのスライスにマッピングすることにより実行することができる。このマッピング方法により、特定のスライスグループ(図3)中に割り当てられた部分および領域と、未割当ての部分および領域310の両方が得られる。 FIG. 3 shows an exemplary assignment of an input video signal or picture to a slice group of output video picture 200 (FIG. 2) generated by CSVCS 110. Input video signal assignment can be achieved in the compression domain by changing the slice header and assigning it to a slice group of the output video. For example, in the assignment shown in FIG. 3, input video signal 0 is assigned to slice group 0, input video signal 1 is assigned to slice group 1, input video signal 2 is assigned to slice group 2, and input video signal 3 And 4 are both assigned to slice group 3. The assignment can be performed by mapping the input video signal to a slice of a slice group in the output picture. This mapping method results in both portions and regions assigned in a particular slice group (FIG. 3) and unassigned portions and regions 310.

ITU-T勧告H.264|ISO/IEC 14496-10によれば、復号化されたピクチャ全体(例えば、出力ビデオピクチャ200)は、ビットストリームに含まれる符号化スライスデータにより記述されなければならない。スライスグループのスライスに対する入力ビデオ信号の割当てが、割り当てられた領域と、未割当ての領域の両方になり得るので、CSVCS 110は、ピクチャを複合する間に、未割当て領域に対する符号化スライスデータを作成するように構成される。 According to ITU-T recommendation H.264 | ISO / IEC 14496-10, the entire decoded picture (for example, output video picture 200) must be described by encoded slice data included in the bitstream. Since the input video signal allocation for slices in a slice group can be both allocated and unallocated regions, the CSVCS 110 creates encoded slice data for the unallocated regions while compositing pictures. Configured to do.

本発明のSVC実施形態の実施において、符号化スライスデータは、スキップマクロブロック(skip macroblock)データまたはイントラ符号化マクロブロックデータを含むことができる。後者のデータは、出力ピクチャの未割当て領域に対するコンテンツを作成するために必要であり得る。イントラ符号化データは、任意の適切なコンテンツを有することができる。そのコンテンツは、例えば、均一なグレーまたは黒色のテクスチャなどの少ないビットレートで送信され得るピクチャ信号を記述することができる。代替的に、またはさらに、コンテンツは、ユーザ情報、グラフィカルな注記、および会議制御機能などのMCUの制御機能の追加を記述することができる。 In the implementation of the SVC embodiment of the present invention, the coded slice data may include skip macroblock data or intra-coded macroblock data. The latter data may be necessary to create content for the unallocated area of the output picture. Intra-encoded data can have any suitable content. The content can describe a picture signal that can be transmitted at a low bit rate, eg, a uniform gray or black texture. Alternatively or additionally, the content can describe the addition of MCU control functions such as user information, graphical annotations, and conference control functions.

システム100では、会議制御機能は、クライアント/参加者による簡単なシグナリングまたは要求(例えば、クライアント/参加者がビデオ表示画像画面上の特定の座標または領域を指し示すことによるシグナリング)に応じて活動化され得る。この目的のために、CSVCS 110は、ビデオ表示画像画面(例えば、いくつかのアクションを開始するためのボタンとして示され、作用する画像領域を有する)上の特定の座標または領域により表されるアクションに、その信号を変換するように構成される。クライアントによるシグナリングは、例えば、CSVCSがウェブサーバと同様に、このような信号を受信するためのHTTPインターフェースを提供するHTTP技法を用いて実施することができる。 In system 100, conference control functions are activated in response to simple signaling or requests by the client / participant (e.g., signaling by the client / participant pointing to a specific coordinate or region on the video display image screen). obtain. For this purpose, the CSVCS 110 is an action represented by specific coordinates or regions on a video display image screen (e.g. shown as a button to initiate some action and having a working image region). And is configured to convert the signal. The signaling by the client can be implemented, for example, using HTTP techniques in which CSVCS provides an HTTP interface for receiving such signals, similar to a web server.

さらに、CSVCS 110は、それにアクセス可能な記憶媒体に記憶された符号化スライスデータビットの複数のバージョンを有するように、かつ/またはそれが動作している会議コンテキストに従って、最小の複雑さでオンザフライにこのような符号化スライスデータビットを生成するように構成される。 Furthermore, the CSVCS 110 has multiple versions of encoded slice data bits stored in a storage medium accessible to it and / or on the fly with minimal complexity, depending on the conference context in which it is operating. It is configured to generate such coded slice data bits.

システム100は、有利には、テレビ会議用途において、エンド-ツー-エンドの遅延性能パラメータを最小化するように構成することができる。例えば、システム100のオペレーションでは、CSVCS 110に対する入力ビデオ信号は、異なる時間解像度を有することができ、あるいはピクチャの時間的なサンプリング間でシフトを有することができる。したがって、出力ビデオ信号を形成する入力ビデオ信号のCSVCS 110への到来時間は変化する可能性がある。CSVCS 110は、入力ビデオ信号の到来時間でトリガされ出力ピクチャを生成することにより、変化する到来時間に対処するように構成することができる。こうすることにより、出力ビデオ信号のより高い時間解像度が得られ、エンド-ツー-エンド遅延と、遅れて到来する入力ビデオ信号により生ずる他の問題とを最小化することができる。さらに、CSVCS 110は、コンテンツが何も存在しないビデオ信号部分に対して、アクセス可能な記憶媒体から、事前に符号化されたスライスを挿入するように構成することができる。 The system 100 can be advantageously configured to minimize end-to-end delay performance parameters in video conferencing applications. For example, in operation of the system 100, the input video signal to the CSVCS 110 can have different temporal resolutions, or can have a shift between temporal samplings of pictures. Accordingly, the arrival time of the input video signal forming the output video signal at the CSVCS 110 may vary. The CSVCS 110 can be configured to handle changing arrival times by being triggered by the arrival time of the input video signal and generating an output picture. By doing this, a higher temporal resolution of the output video signal can be obtained and the end-to-end delay and other problems caused by the incoming video signal arriving late can be minimized. In addition, the CSVCS 110 can be configured to insert a pre-encoded slice from an accessible storage medium for the portion of the video signal that has no content.

本発明の1つのテレビ会議実装形態では、スキップピクチャ(すなわち、前のフレームからのすべてのピクチャコンテンツの複製)、または低ビットレートの符号化スライスを、変化しない出力ピクチャコンテンツを表すために使用することができる。このようなテレビ会議実装形態では、受信側テレビ会議参加者は、ITU-T勧告H.264|ISO/IEC 14496-10のref_pic_list_reordering(参照ピクチャリストの順序入れ換え)シンタックス構造を用いて、自分の端末の復号器を動作させることにより、正しい参照ピクチャ(すなわち、送信参加者の符号器により、参照ピクチャとして使用されることが元々意図されていたピクチャ)にアクセスできるようになる。さらに、CSVCS 110は、参照ピクチャリストの順序入れ換えを変更するように、適切に構成され得る。使用される任意の他の時間的レイヤ化構造に対して、同様の処理または手順を使用することができる。 In one videoconferencing implementation of the present invention, skip pictures (ie, duplicates of all picture content from previous frames), or low bit rate encoded slices are used to represent unchanged output picture content. be able to. In such a videoconference implementation, a receiving videoconference participant uses his / her ref_pic_list_reordering (reordering of reference picture list) syntax structure of ITU-T Recommendation H.264 | ISO / IEC 14496-10. Operating the terminal decoder allows access to the correct reference picture (ie, the picture originally intended to be used as the reference picture by the transmitting participant's encoder). Further, the CSVCS 110 may be suitably configured to change the reordering of the reference picture list. Similar processing or procedures can be used for any other temporal layering structure used.

本発明の他のテレビ会議実装形態では、入力ビデオ信号は、高められた時間解像度で符号化することができる。時間解像度の増加は、前に符号化されたピクチャの複製(すなわち、スキップピクチャ)である追加のピクチャを送信することにより達成することができる。ピクチャ解像度とは独立して、スキップされたCIFピクチャに対するバイト数は、ピクチャ/スライスヘッダに対して2〜3バイトであり、マクロブロックに対するスキップのシグナリングのために2〜3バイトである。このビットレートは無視できることに留意されたい。追加のピクチャの符号化された表現は、送信参加者にアクセス可能な記憶媒体に記憶され得るが、あるいは最小の複雑さで早急に（on the fly）生成されて、ビットストリーム中に挿入され得る。本発明のSVC実施形態の実装において、1秒間に送信されるマクロブロックのこの増加は、スキップスライスを効率的に扱うために、特別の対策が実施され得るので、受信エンドポイントにおける処理パワーに悪影響を与える必要はない。さらに、H.264 MaxStaticMBPS処理レートパラメータ(ITU-T勧告H.241でMaxStaticMBPSと呼ばれる)は、ITU-T勧告H.264|ISO/IEC 14496-10レベルのシグナリングを調整するために使用され得る。入力ビデオ信号がより高い時間解像度である場合、CSVCS 110は、その高い時間解像度で動作することができる。CSVCS 110はさらに、所与のスケジュールに従って、入力ビデオ信号からの到来するピクチャを含めるように、また到来ジッタに対して補償するためにスキップピクチャとして挿入される非参照ピクチャを用いるように構成することができる。この補償は、スキップピクチャを、遅れて到来した符号化ピクチャと交換することにより達成することができる。このような実装形態では、送信参加者は、ITU-T勧告H.264|ISO/IEC 14496-10のref_pic_list_reorderingシンタックス構造を用いて自分の符号器を動作させることにより、正しい参照ピクチャ(すなわち、送信参加者の符号器により使用されることが元々意図されていた参照ピクチャ)を利用することができる。 In other videoconferencing implementations of the present invention, the input video signal can be encoded with increased temporal resolution. Increasing temporal resolution can be achieved by sending additional pictures that are duplicates of previously encoded pictures (ie, skipped pictures). Independent of picture resolution, the number of bytes for a skipped CIF picture is 2-3 bytes for the picture / slice header and 2-3 bytes for skip signaling for the macroblock. Note that this bit rate is negligible. The encoded representations of the additional pictures can be stored on a storage medium accessible to the sending participant, or can be generated on the fly with minimal complexity and inserted into the bitstream . In the implementation of the SVC embodiment of the present invention, this increase in the number of macroblocks transmitted per second adversely affects the processing power at the receiving endpoint since special measures can be implemented to efficiently handle skip slices. There is no need to give Further, the H.264 MaxStaticMBPS processing rate parameter (referred to as MaxStaticMBPS in ITU-T recommendation H.241) may be used to adjust ITU-T recommendation H.264 | ISO / IEC 14496-10 level signaling. If the input video signal is at a higher temporal resolution, the CSVCS 110 can operate at that higher temporal resolution. CSVCS 110 is further configured to include incoming pictures from the input video signal according to a given schedule and to use non-reference pictures inserted as skip pictures to compensate for incoming jitter. Can do. This compensation can be achieved by replacing the skipped picture with a late-arrived coded picture. In such an implementation, the transmitting participant operates the coder using the ref_pic_list_reordering syntax structure of ITU-T Recommendation H.264 | ISO / IEC 14496-10 to ensure that the correct reference picture (i.e. A reference picture originally intended to be used by the transmitting participant's encoder can be used.

本発明のさらなるマルチポイントテレビ会議実装形態では、様々な参加者が、システム100において、異なるビットレートおよび異なる空間および時間解像度を要求する場合、送信参加者は、複数の時間解像度でビデオ信号を作成することができる。図4は、複数の時間解像度ピクチャL0、L1、L2を有するビデオ信号に対する例示的なレイヤ化されたスレッディング時間予測構造400を示す。図4でL2としてラベル付けされたピクチャは、インター予測のための参照ピクチャとして使用されないことに留意されたい。しかし、L0、およびL0、L1とラベル付けされたピクチャは予測チェーンを形成する。これらのピクチャ(L0、L1)の1つが受信参加者の復号器で参照用に利用できない場合、空間-時間エラーの伝播が、主観的な視覚的歪みを導く可能性がある。本発明のSVC実施形態では、CSVCS 110に入力信号として送られたL2とラベル付けされたピクチャは、「not-used-for-reference(参照として不使用)」とマーキングすることができる。CSVCSにより複合出力ピクチャのコンポーネントとして送信された場合、同じL2ピクチャは、複合ピクチャの他のコンポーネントが、used-for-reference(参照として使用)とマーキングされる場合、「used-for-reference」とマーキングされる必要がある。これは、L2ピクチャがused-for-referenceとしてマーキングされる必要がない国際特許出願第PCT/US06/28365号および第PCT/US06/28366号に記載のSVCベースのテレビ会議システムにおけるその有用性とは対照的である。L2ピクチャの使用における差は、ITU-T勧告H.264|ISO/IEC 14496-10は、ピクチャが、参照および非参照スライスの複合であることを許容せず、一方または他方の複合だけを許容するため生ずる。ITU-T勧告H.264|ISO/IEC 14496-10の準拠において、同時刻におけるCSVCS 110への複数の入力ビデオ信号が、参照および非参照スライスを含む場合、それらは、同じ出力ピクチャ中に混合することはできない。したがって、システム100のオペレーションにおいて、非参照L2ピクチャを出力ストリーム中に混合するために、CSVCS 110はピクチャL2を
参照ピクチャとしてラベル付けし、かつ使用する。ピクチャL2は、ピクチャL0またはL1と類似した量のビットを必要とする普通の符号化ピクチャとして符号化され、特定の(L2)解像度を要求している受信参加者に向けて送られる出力ピクチャ中に挿入され得る。L2としてラベル付けされたピクチャを要求していない他の受信参加者に向けて送られる出力ピクチャの場合、CSVCS 110は、対応する入力ビデオ信号からのL2ピクチャ用に受信されたビットを、スキップピクチャに対応するビットで置き換えるように構成され得る。前述のマルチポイントテレビ会議のシナリオでは、送信参加者は、ITU-T勧告H.264|ISO/IEC 14496-10のref_pic_list_reorderingシンタックス構造を用いて自分の符号器を動作させることにより、ピクチャL0およびL2に対する正しい参照ピクチャ(すなわち、送信参加者の符号器により、参照として使用されることが元々意図されたピクチャ)を利用することができる。このプロセスはさらに、L1ピクチャに拡張することができ、SVCSと同様に、レートマッチング、および統計的多重化目的に使用することができる。 In a further multipoint video conferencing implementation of the present invention, when various participants request different bit rates and different spatial and temporal resolutions in the system 100, the transmitting participant creates video signals at multiple temporal resolutions. can do. FIG. 4 shows an exemplary layered threading time prediction structure 400 for a video signal having multiple time resolution pictures L0, L1, L2. Note that the picture labeled L2 in FIG. 4 is not used as a reference picture for inter prediction. However, the pictures labeled L0 and L0, L1 form a prediction chain. If one of these pictures (L0, L1) is not available for reference at the receiving participant's decoder, the propagation of space-time errors may lead to subjective visual distortion. In the SVC embodiment of the present invention, the picture labeled L2 sent to the CSVCS 110 as an input signal can be marked as “not-used-for-reference”. When sent as a component of a composite output picture by CSVCS, the same L2 picture is marked as “used-for-reference” if the other components of the composite picture are marked as used-for-reference. Need to be marked. This is because of its usefulness in SVC-based video conferencing systems described in International Patent Applications Nos. PCT / US06 / 28365 and PCT / US06 / 28366 where L2 pictures do not need to be marked as used-for-reference Is in contrast. The difference in the use of L2 pictures is that ITU-T Recommendation H.264 | ISO / IEC 14496-10 does not allow a picture to be a composite of reference and non-reference slices, only one or the other. To occur. In accordance with ITU-T Recommendation H.264 | ISO / IEC 14496-10, if multiple input video signals to CSVCS 110 at the same time contain reference and non-reference slices, they are mixed in the same output picture I can't do it. Accordingly, in operation of the system 100, the CSVCS 110 labels and uses the picture L2 as a reference picture to mix non-reference L2 pictures into the output stream. Picture L2 is encoded as a normal coded picture that requires a similar amount of bits as picture L0 or L1, and in the output picture sent to the receiving participant requesting a specific (L2) resolution Can be inserted. For output pictures sent to other receiving participants that do not require a picture labeled as L2, the CSVCS 110 will replace the bits received for the L2 picture from the corresponding input video signal with a skip picture. May be configured to replace with a bit corresponding to. In the aforementioned multipoint videoconferencing scenario, the sending participant operates picture coder L0 and ITU-T Recommendation H.264 | ISO / IEC 14496-10 by operating his encoder using the ref_pic_list_reordering syntax structure. The correct reference picture for L2 (ie, the picture originally intended to be used as a reference by the transmitting participant's encoder) can be utilized. This process can be further extended to L1 pictures and can be used for rate matching and statistical multiplexing purposes, similar to SVCS.

図5は、空間スケーラブル予測、交互にSNRスケーラブル予測、またはシステム100のオペレーションで使用され得るこれらの予測の混合に対して適切な例示的レイヤ化構造500を示している。構造500では、予測用のベースレイヤがL0とラベル付けされる。2つの補強レイヤは、S0およびQ0とラベル付けされる。S0はQ0に依存しておらず、その逆も同様である。しかし、予測を通して、S0またはQ0に依存する他のレイヤがあり得る。本発明SVC実施形態の実装では、L0はQCIFピクチャであり、Q0は、3/2 QCIFピクチャまたはCIFピクチャとすることができる。例示的なマルチパーティのテレビ会議シナリオでは、1人の受信参加者だけが、3/2 QCIFピクチャを要求し、他のすべての参加者がCIFまたはQCIFピクチャを要求する可能性がある。システム100のオペレーションにおけるこのシナリオでは、送信参加者は、QCIFおよびCIFピクチャの生成に加えて、さらに、送信における全体のシステム効率のために3/2 QCIFピクチャを生成することができる。さらに、このシナリオでは、CSVCS 110は、各受信参加者の解像度で信号を復号するのに必要なビットを転送するように、適切に構成することができる。さらに、改善されたCSVCS 110オペレーションのために、送信参加者は、予測用に指定または使用されていないビットストリームの部分を廃棄可能なフラグでラベル付けすることができ、それは、例えば、国際特許出願第PCT/US06/28365号に記載されている。 FIG. 5 shows an exemplary layered structure 500 suitable for spatial scalable prediction, alternating SNR scalable prediction, or a mixture of these predictions that may be used in the operation of system 100. In structure 500, the base layer for prediction is labeled L0. The two reinforcement layers are labeled S0 and Q0. S0 does not depend on Q0 and vice versa. However, there may be other layers that depend on S0 or Q0 throughout the prediction. In the implementation of the SVC embodiment of the present invention, L0 can be a QCIF picture and Q0 can be a 3/2 QCIF picture or a CIF picture. In an exemplary multi-party video conference scenario, only one receiving participant may request a 3/2 QCIF picture and all other participants may request a CIF or QCIF picture. In this scenario in the operation of the system 100, in addition to generating QCIF and CIF pictures, the transmission participant can also generate 3/2 QCIF pictures for overall system efficiency in transmission. Further, in this scenario, the CSVCS 110 can be appropriately configured to transfer the bits necessary to decode the signal at the resolution of each receiving participant. In addition, for improved CSVCS 110 operation, the sending participant can label portions of the bitstream that are not designated or used for prediction with a discardable flag, which is, for example, an international patent application. It is described in PCT / US06 / 28365.

図6は、時間的レイヤ化構造(図4)と空間的スケーラブルレイヤ化構造(図5)を組み合わせた、さらなるレイヤピクチャ符号化構造600を示す。組み合わせた構造は、システム100のオペレーションで使用することができる。このような場合、システム100は、会議エンティティ(すなわち、それぞれが、スケーラブルなビデオ符号器、CSVCS 110を動作させる送信参加者、およびそれぞれが、スケーラブルなビデオ復号器を動作させる受信参加者)が、相互間で双方向制御チャネルを維持するように構成される。送信参加者からCSVCS 110に、またCSVCS 110から受信参加者への制御チャネルは、本明細書で順方向制御チャネルと呼ぶことができる。反対に、受信参加者からCSVCS 110に、またCSVCS 110から送信参加者への制御チャネルは、本明細書で逆方向制御チャネルと呼ぶことができる。システムのオペレーションでは、会議エンティティ中の実際の通信の前に、その制御チャネルを介して、能力交換(capability exchange)を行うことができる。能力交換は、各送信参加者によりサポートされる空間および時間的なビデオ解像度の範囲のシグナリングを含むことができる。送信参加者の能力の範囲は各受信参加者へと送られ、受信参加者は、次いで、それに従って、送信者にビデオ機能を求める自分の要求を選択し、または制限することができる。 FIG. 6 shows a further layer picture coding structure 600 that combines a temporal layering structure (FIG. 4) and a spatial scalable layering structure (FIG. 5). The combined structure can be used in the operation of the system 100. In such a case, the system 100 may be configured such that the conferencing entities (i.e., each is a scalable video encoder, a transmitting participant operating the CSVCS 110, and each receiving participant operating a scalable video decoder) It is configured to maintain a bidirectional control channel between each other. The control channel from the sending participant to the CSVCS 110 and from the CSVCS 110 to the receiving participant may be referred to herein as a forward control channel. Conversely, the control channel from the receiving participant to the CSVCS 110 and from the CSVCS 110 to the sending participant may be referred to herein as a reverse control channel. In operation of the system, capability exchange can occur via its control channel prior to actual communication in the conference entity. Capability exchange can include a range of spatial and temporal video resolution ranges supported by each sending participant. The range of capabilities of the sending participant is sent to each receiving participant who can then select or limit his request for video functionality from the sender accordingly.

逆方向制御チャネルを介して、受信参加者は、現在送られているものとは異なる空間ビデオ解像度を要求することができる。同様に、テレビ会議セッションに入る受信参加者は、特定の空間ビデオ解像度のビデオを要求することができる。本発明のSVC実施形態の実装では、CSVCS 110は、受信参加者に送られる出力ピクチャに対するスライスグループの境界を変更することにより、受信参加者の要求に応ずるように構成される。送信参加者のスケーラブルなビデオ符号器によりサポートされる空間解像度に応じて、CSVCS 110は、その逆方向制御チャネルを介して、受信参加者の要求を満足させるために、他の空間解像度をサポートまたは生成する必要があるかどうかを、そのスケーラブルなビデオ符号器に通知することができる。 Through the reverse control channel, the receiving participant can request a different spatial video resolution than that currently being sent. Similarly, a receiving participant entering a video conference session can request a video with a particular spatial video resolution. In the implementation of the SVC embodiment of the present invention, the CSVCS 110 is configured to meet the receiving participant's request by changing the slice group boundary for the output picture sent to the receiving participant. Depending on the spatial resolution supported by the sending participant's scalable video encoder, the CSVCS 110 supports other spatial resolutions to satisfy the receiving participant's request via its reverse control channel or The scalable video encoder can be notified if it needs to be generated.

国際特許出願第PCT/US06/28366号は、例えば、国際特許出願第PCT/US06/028365号に記載されている、符号化構造を処理するように設計されたスケーラブルなテレビ会議サーバ(SVCS)を述べていることにここで留意されたい。前者の出願において記載されたSVSCは、スケーラブルなビデオ符号化を用いて、ビデオ品質、解像度、およびビットレートを操作するその能力に基づき、マルチポイント会議のために設計された様々な機能を有する。前述のSVSCは、会議参加者のエンドポイントが、エンドユーザに複数の参加者ビュー(「画面分割表示」)を提供するためにいくつかの復号器を展開しているものと仮定する。しかし、いくつかの会議状況では、エンドポイント中で単一の復号器だけを動作させることが有利であり、または必要な場合がある。このような会議状況の場合は、前述のSVSCはさらに、本明細書で述べられるCSVCSの複合機能を有し、かつ適用するように構成または変更することができる。オペレーションにおいては、変更されたSVCSは、変更されないSVCSの機能のいくつかまたはすべてを利用した後、CSVCS 110の機能を利用することができる。 International Patent Application No. PCT / US06 / 28366 is a scalable video conferencing server (SVCS) designed to process coding structures, as described, for example, in International Patent Application No. PCT / US06 / 028365. Note that what is stated here. The SVSC described in the former application has various functions designed for multipoint conferencing based on its ability to manipulate video quality, resolution, and bit rate using scalable video coding. The aforementioned SVSC assumes that the conference participant's endpoint has deployed several decoders to provide the end user with multiple participant views ("screen split display"). However, in some conferencing situations it may be advantageous or necessary to operate only a single decoder in the endpoint. In such a meeting situation, the aforementioned SVSC may further have the complex functions of CSVCS described herein and can be configured or modified to apply. In operation, a modified SVCS can utilize some or all of the unmodified SVCS functions and then utilize the CSVCS 110 functions.

CSVCSまたは変更されたSVCSの機能の理解を助けるために、参照により本明細書に組み込む、国際特許出願第PCT/US06/28365号、第PCT/US06/028366号、第PCT/US06/028367号、第PCT/US06/027368号、および第PCT/US06/061815号との関連を参照して、どのようにSVCSの機能がCSVCSにより提供され得るかの諸例をここで検討することが有益である。 International Patent Applications Nos. PCT / US06 / 28365, PCT / US06 / 028366, PCT / US06 / 028367, incorporated herein by reference to aid understanding of CSVCS or modified SVCS functionality, With reference to PCT / US06 / 027368 and its association with PCT / US06 / 061815, it is useful to consider here examples of how SVCS functionality can be provided by CSVCS .

まず、国際特許出願第PCT/US06/028366号を参照すると、参照された用途で述べられている、SVCSオペレーションに適用される少なくともベースレイヤデータを保護する同様の原理を、送信エンドポイントとCSVCSの間、CSVCSと受信エンドポイントの間、さらにカスケード接続されたCSVCSの間のネットワーク接続において、CSVCSオペレーションに直接適用できることに留意されたい。このようなサービス品質サポートは、FEC、ARQ(肯定/否定応答)、事前対応の再送信など、SVCSにより使用されるものと同様の、または同一の手段および技法を用いてCSVCSにより提供され得る。CSVCSにより人工的レイヤ(artificial layer)が作成される場合、それらは、正規のレイヤデータ(すなわち、1つまたは複数の送信エンドポイントから受信された符号化ピクチャ)と同様な方法で、高い信頼性のまたは低い信頼性のチャネルを介して送信することができる。SVCSのものと同様な方法で、CSVCSは、複合出力ビデオストリームから補強レイヤデータを選択的に削除することにより、変化するネットワーク状態(例えば、輻輳)に応ずることができる。SVCSにより使用される統計的多重化技法をまた、CSVCSによっても使用することができ、したがって、複合出力ビデオストリーム中のピクチャの時間的な整列は、送信エンドポイントから受信されるコンポーネントピクチャのサブセットだけがその長期平均サイズを大幅に超えることが許容されるような方法で実施される。CSVCSはまた、SVCSと同様な方法で、スケーラブルな符号化オーディオストリームを用いてオーディオ能力を特徴付けることができる。オーディオの場合、「空間多重化」に対応する、ビデオ中に存在するスライスグループの概念と等価なものはない。SVCSのオーディオ機能に対する並列処理は、オーディオストリームの従来の混合である。しかし、このオーディオの混合は、SVCSオーディオオペレーションのさらなる出力段であると見なすことができ、したがって、例えば、オーディオクリッピング効果の低減または削除に関連するアルゴリズムは、なお、CSVCSにより同様に使用され得る。最後に、CSVCSはまた、SVCSと同様な方法で、ネットワークアドレス変換、プロキシ動作など、ネットワークに関係する機能を実施することができる。 First, referring to International Patent Application No. PCT / US06 / 028366, the similar principle of protecting at least base layer data applied to SVCS operations, described in the referenced application, is similar to the transmission endpoint and CSVCS. Note that it can be applied directly to CSVCS operations in the network connection between the CSVCS and the receiving endpoint, and also between the cascaded CSVCS. Such quality of service support may be provided by CSVCS using means and techniques similar to or the same as those used by SVCS, such as FEC, ARQ (Positive / Negative Acknowledgment), proactive retransmission. When CSVCS creates artificial layers, they are highly reliable in a manner similar to regular layer data (i.e., encoded pictures received from one or more sending endpoints). Can be transmitted over low or unreliable channels. In a manner similar to that of SVCS, CSVCS can adapt to changing network conditions (eg, congestion) by selectively removing enhancement layer data from the composite output video stream. The statistical multiplexing technique used by SVCS can also be used by CSVCS, so the temporal alignment of pictures in the composite output video stream is only a subset of the component pictures received from the sending endpoint Is implemented in such a way that it is allowed to greatly exceed its long-term average size. CSVCS can also characterize audio capabilities using scalable encoded audio streams in a manner similar to SVCS. In the case of audio, there is no equivalent to the concept of a slice group existing in video corresponding to “spatial multiplexing”. Parallel processing for the SVCS audio function is a traditional mix of audio streams. However, this audio mixing can be considered as a further output stage of SVCS audio operation, and thus, for example, algorithms associated with reducing or eliminating the audio clipping effect can still be used by CSVCS as well. Finally, CSVCS can also perform network-related functions such as network address translation and proxy operations in the same way as SVCS.

SVCSは、1つまたは複数の送信エンドポイントと受信エンドポイントをリンクさせるカスケード構成で、CSVCSと共に展開され得ることに留意されたい。複合出力ピクチャが、受信エンドポイントにより必要である場合、CSVCSをカスケード構成の最後のサーバとして配置し、かつSVCSをそのカスケード構成中の他のより高い位置に配置することは有利となろう。国際特許出願第PCT/US06/028367号に記載のトランキング(trunking)設計を、SVCSカスケード構成と同様な方法で、CSVCS/SVCSカスケード構成に適用できることをさらに留意されたい。 It should be noted that SVCS can be deployed with CSVCS in a cascade configuration that links one or more sending and receiving endpoints. If a composite output picture is needed by the receiving endpoint, it may be advantageous to place the CSVCS as the last server in the cascade and place the SVCS at another higher position in the cascade. It should be further noted that the trunking design described in International Patent Application No. PCT / US06 / 028367 can be applied to the CSVCS / SVCS cascade configuration in a similar manner to the SVCS cascade configuration.

さらに、国際特許出願第PCT/US06/027368号に記載のSVCSシステムに対するジッタ低減技法をCSVCSに直接適用することができ、送信されない任意の補強レイヤデータは、本発明の原理に従って適切な人工的レイヤデータにより置き換えることができる。 In addition, the jitter reduction technique for SVCS systems described in International Patent Application No. PCT / US06 / 027368 can be applied directly to CSVCS, and any augmented layer data that is not transmitted is a suitable artificial layer according to the principles of the present invention. Can be replaced by data.

CSVCSまたは変更されたSVCSの機能の理解をさらに助けるために、国際特許出願第PCT/US06/061815号を参照して、CSVCSによりどのようにSVCSの機能が提供されるかのさらなる諸例をここで検討することも有用である。 To further assist in understanding CSVCS or modified SVCS functionality, refer to International Patent Application No. PCT / US06 / 061815, here for more examples of how CSVS functionality is provided by CSVCS It is also useful to consider in

SVCSのコンテキストにおいて、国際特許出願第PCT/US06/061815号に記載のエラー耐性、ランダムアクセス、およびレイヤ切換え技法はまた、CSVCSシステム中で直接の用途を有する。これらの技法を適用する場合、送信エンドポイントとCSVCSの間の接続は、SVCSとCSVCSの間の特徴的な差が、その出力ビデオ信号のフォーマット形式にあり、接続の性質にはないので、送信エンドポイントとSVCSの間の接続と同様な方法で処理できることに留意されたい。CSVCSと受信エンドポイントの間の接続の場合、CSVCSコンテキスト中の各スライスグループデータをSVCSコンテキスト中の送信参加者のピクチャデータと等価であると見なすことにより、かつ、第1に両方の場合でパケットヘッダデータだけが異なる可能性があること、および第2に追加の人工的レイヤデータがCSVCSにより生成され得ることに気づくことにより、同様のエラー耐性およびランダムアクセス保護技法が、CSVCSの出力パケットに適用され得ることが分かる。例えばCSVCS環境における信頼性のある送信のためのピクチャデータのマーキングは、SVCS環境中と同様な方法で実施することができる(例えば、RTPヘッダ拡張、RTCPフィードバックによるRNACKなどによる)。SVCS環境におけるRピクチャの概念は、CSVCS環境中のRスライスグループの概念に変換される。Rピクチャのキャッシング、送信エンドポイントの符号器で周期的なイントラマクロブロックを使用すること、および受信エンドポイントにおける早送り(fast-forward)復号化もまた、CSVCS環境における個々のスライスグループのコンテキスト内で適用される。SVCS環境で有用なレイヤ切換え技法はまた、同様に使用することができる。例えば、エラーの回復、または新しい参加者をサポートするためのサーバベースのイントラフレームの概念を、CSVCS環境におけるスライスグループに適用することができる。SVCSと同様に、CSVCSは、送信参加者から入力されるビデオデータの一部を、特に、ベースレイヤの少なくとも最も低い時間レベルを復号する必要があり、またその復号されたピクチャデータをイントラとして再符号化するように備える必要がある。受信エンドポイントで、マルチループ符号化機能が利用可能である場合、レイヤ切換えは、サーバがイントラデータを供給する必要がないので、SVCSと同様にかなり簡単化される。 In the context of SVCS, the error resilience, random access, and layer switching techniques described in International Patent Application No. PCT / US06 / 061815 also have direct application in a CSVCS system. When applying these techniques, the connection between the sending endpoint and the CSVCS is transmitted because the characteristic difference between the SVCS and the CSVCS is in the format of its output video signal and not in the nature of the connection. Note that it can be handled in the same way as the connection between the endpoint and the SVCS. For connections between CSVCS and receiving endpoints, consider each slice group data in the CSVCS context to be equivalent to the sending participant's picture data in the SVCS context, and firstly in both cases the packet Similar error resilience and random access protection techniques are applied to CSVCS output packets by noting that only the header data can be different and, second, that additional artificial layer data can be generated by CSVCS. It can be seen that For example, the marking of picture data for reliable transmission in the CSVCS environment can be performed in the same way as in the SVCS environment (for example, by RTP header extension, RNACK with RTCP feedback, etc.). The concept of R picture in the SVCS environment is converted to the concept of R slice group in the CSVCS environment. R-picture caching, the use of periodic intra macroblocks at the encoder of the transmitting endpoint, and fast-forward decoding at the receiving endpoint are also within the context of individual slice groups in the CSVCS environment. Applied. Layer switching techniques useful in SVCS environments can also be used as well. For example, the concept of error recovery or server-based intraframes to support new participants can be applied to slice groups in a CSVCS environment. Like SVCS, CSVCS needs to decode part of the video data coming from the sending participant, especially at least the lowest time level of the base layer, and replay the decoded picture data as intra. It is necessary to prepare for encoding. If the multi-loop coding function is available at the receiving endpoint, layer switching is simplified as much as SVCS because the server does not need to supply intra data.

最後に、米国仮特許出願第60/778,760号および第60/787,031号に記載のレート制御技法、米国仮特許出願第60/774,094号に記載のストリームのシニング(thinning)技法、および米国仮特許出願第60/827,469号に記載のマルチキャストSVCS技法はまた、CSVCSに直接適用可能である。例えば、適切に基準化された、ベースレイヤの符号化された情報(モード、動きベクトル等)を用いることにより、復号器においてS2ピクチャが隠される仮特許出願第60/787,031号に記載の技法が、CSVCS環境における特定のスライスグループ内のデータに適用され得る。重要なことは、同様の秘匿(concealment)効果は、復号器にベースレイヤ情報を使用するように命令する符号化データを、CSVCSにおけるS2ピクチャと交換し、複合出力ピクチャ中のその場所に挿入することにより実現できることである。この手法の利点は、受信エンドポイントが何らかの特別なサポートを要求しないことであり、したがって、SVC準拠の復号器がいずれも、正しく動作することになる。 Finally, rate control techniques as described in U.S. Provisional Patent Applications 60 / 778,760 and 60 / 787,031, stream thinning techniques as described in U.S. Provisional Patent Application 60 / 774,094, and U.S. Provisional Patent Applications. The multicast SVCS technique described in 60 / 827,469 is also directly applicable to CSVCS. For example, the technique described in Provisional Patent Application No. 60 / 787,031, in which S2 pictures are hidden in the decoder by using appropriately scaled base layer encoded information (modes, motion vectors, etc.) It can be applied to data in a specific slice group in a CSVCS environment. Importantly, a similar concealment effect is to replace the encoded data that instructs the decoder to use base layer information with the S2 picture in CSVCS and insert it at that location in the composite output picture. It can be realized. The advantage of this approach is that the receiving endpoint does not require any special support, so any SVC compliant decoder will work correctly.

上記の諸例は例示的なものに過ぎず、網羅的であること、または限定することを意図していない。本発明の原理に従って、複合出力ビデオ信号の生成プロセスを適切に処理すれば、任意のSVCSオペレーションがCSVCSで実施され得ることが理解されよう。 The above examples are illustrative only and are not intended to be exhaustive or limiting. It will be appreciated that any SVCS operation can be implemented in a CSVCS if the composite output video signal generation process is properly handled in accordance with the principles of the present invention.

改めて図1を参照すると、システム100およびCSVCS 110のオペレーションにおいて、複合されたビットストリーム中に存在する個々の参加者に関連する個々のビットストリームは、その複合されたビットストリームから容易に抽出できることをさらに留意されたい。CSVCS 110は、複合されたビットストリームからこれらの個々のビットストリームを簡単に抽出し、またそれらを異なる複合化ビットストリーム中に再挿入するように構成され得る。CSVCS 110のこの構成は、カスケード化されたCSVCS 110が、参加者またはダウンストリームのサーバのプリファレンスに従って、構成するストリームの完全な再多重化を提供することを可能にする。したがって、再多重化機能を有するこのようなCSVCS 110は、拡張されたテレビ会議システムのカスケード化、および分散されたオペレーション機能を完全にサポートすることができ、それは、例えば、国際特許出願第PCT/US06/28366号に記載されている。 Referring again to FIG. 1, in the operation of system 100 and CSVCS 110, it can be seen that individual bitstreams associated with individual participants present in the composite bitstream can be easily extracted from the composite bitstream. Note further. CSVCS 110 may be configured to easily extract these individual bitstreams from the combined bitstream and reinsert them into a different combined bitstream. This configuration of CSVCS 110 allows cascaded CSVCS 110 to provide full remultiplexing of the streams it configures according to participant or downstream server preferences. Thus, such a CSVCS 110 with re-multiplexing capability can fully support extended video conferencing system cascading and distributed operation capabilities, which can be found, for example, in International Patent Application No. PCT / It is described in US06 / 28366.

システム100はさらに、本発明に従って、信号ソース識別情報または他の有用な情報(例えば、ディレクトリ情報、オンスクリーンのヘルプなど)を個々の参加者および/またはスライスグループに搬送するように構成することができ、したがって、ソース識別または他の情報を参加者の表示画面上に表示することができる。システム100のこの構成により、参加者が、複合ピクチャ中に含まれたストリームのソースを識別することが可能になる。識別情報は、個々の参加者のビデオ信号に対応するスライスグループのそばに表示されるテキストストリングまたは事前に合成されたスライスデータを識別することを含み得る。例えば、識別情報は、参加者を名前で(例えば、「ジョンスミス」)、または場所で(ダラス、部屋A)識別するテキストストリングを含むことができる。複合されたピクチャでは、識別情報または他の搬送された情報は、各参加者の個々の画素上に重ね合わせることができ、あるいは個々の参加者に割り当てられた画像領域を囲む未割当ての画像領域(例えば、未割当て領域310、図3)中で表示することができる。識別情報は、専用データとして、帯域外または帯域内で送信され得る。 System 100 may be further configured to carry signal source identification information or other useful information (e.g., directory information, on-screen help, etc.) to individual participants and / or slice groups in accordance with the present invention. Thus, source identification or other information can be displayed on the participant's display screen. This configuration of the system 100 allows the participant to identify the source of the stream contained in the composite picture. The identification information may include identifying a text string or pre-synthesized slice data displayed beside the slice group corresponding to the individual participant's video signal. For example, the identification information can include a text string that identifies the participant by name (eg, “John Smith”) or by location (Dallas, Room A). In a composite picture, identification information or other conveyed information can be superimposed on individual pixels of each participant, or an unassigned image area that surrounds an image area assigned to each participant (Eg, unallocated area 310, FIG. 3). The identification information may be transmitted as out-of-band or in-band as dedicated data.

本発明のSVC実施形態の説明は、以下では、スライスグループを用いる合成の特有の機構に関し、ならびに出力ビットストリームがスケーラブルなビデオ復号器に準拠していることを保証することが必要な場合、さらなるレイヤデータの生成に関する。 The description of the SVC embodiment of the present invention will be described below with respect to the specific mechanism of synthesis using slice groups, as well as when it is necessary to ensure that the output bitstream is compliant with a scalable video decoder. It relates to generation of layer data.

入力ビットストリームを複合ピクチャのスライスグループに割り当てるために、CSVCSは、複合ピクチャにおけるスライスグループのレイアウトを記述するマップを使用する。具体的には、以降は、MapOfMbsToSliceGroupsで示されるこのマップは、出力ビットストリームの複合ピクチャを含むマクロブロックと、入力ビットストリームを識別するスライスグループとの間の関連を提供する。 In order to assign an input bitstream to a slice group of a composite picture, CSVCS uses a map that describes the layout of the slice group in the composite picture. Specifically, this map, hereinafter referred to as MapOfMbsToSliceGroups, provides an association between a macroblock that includes a composite picture of the output bitstream and a slice group that identifies the input bitstream.

図7を参照すると、サーバにおいて、解像度がそれぞれQCIF、CIF、およびCIFの3つの入力されるストリームがあること、またその3つの入力されるストリームから、ピクチャサイズが4CIFの複合ビデオ信号を作成することが望ましいと仮定する。可能なマップMapOfMbsToSliceGroups(マップ700)が図7に示されている。マップ700では、0と指標(index)が付されたスライスグループ705は、QCIFストリームに対応し、またスライスグループ1および2(それぞれ、710および720)は、CIFストリームに対応する。ピクチャ中の未割当て領域730はまた、スライスグループ指標を有する(例えば、この場合3)。 Referring to FIG. 7, the server has three input streams with resolutions QCIF, CIF, and CIF, respectively, and creates a composite video signal with a picture size of 4CIF from the three input streams. Assume that is desirable. A possible map MapOfMbsToSliceGroups (map 700) is shown in FIG. In the map 700, a slice group 705 assigned 0 and an index corresponds to a QCIF stream, and slice groups 1 and 2 (710 and 720, respectively) correspond to a CIF stream. The unallocated area 730 in the picture also has a slice group index (eg, 3 in this case).

マップMapOfMbsToSliceGroups(例えば、マップ700)は唯一のものではなく、複合ピクチャ中に異なるスライスグループをレイアウトする複数の方法があり得ることに留意されたい。特定のレイアウトは、ユーザによる特定の要求により取得することができ、またCSVCSにより、あるいは任意の他の適切な技法により自動的に計算され得る。同様に、スライスグループの特定の番号付けは、任意の適切な技法を用いて、例えば、一技法では、入力されるビットストリームに指標を付すことにより取得することができ、次いで、その指標に従って対応するスライスグループを、最小から最大に、ラスター走査では、複合ピクチャ中、左から右に、上部から下部に配置する。 Note that the map MapOfMbsToSliceGroups (eg, map 700) is not unique, and there can be multiple ways to lay out different slice groups in a composite picture. The specific layout can be obtained by a specific request by the user, and can be calculated automatically by CSVCS or by any other suitable technique. Similarly, specific numbering of slice groups can be obtained using any suitable technique, for example, in one technique by indexing the incoming bitstream and then responding according to that index. Slice groups to be arranged are arranged from the minimum to the maximum, and in raster scanning, from left to right and from top to bottom in the composite picture.

適正に復号できるように、複合ビデオ信号を受信する参加者に対して、マップMapOfMbsToSliceGroupsを送信することが必要になり得る。このような送信は、H.264の7.3.2.2節および7.4.2.2節で指定されたスライスグループ識別シンタックスにより、複合信号に対するピクチャパラメータセットに、MapOfMbsToSliceGroupsを組み込むことにより達成され得る。 It may be necessary to send the map MapOfMbsToSliceGroups to participants receiving composite video signals so that they can be properly decoded. Such transmission may be accomplished by incorporating MapOfMbsToSliceGroups into the picture parameter set for the composite signal, with the slice group identification syntax specified in sections 7.3.2.2 and 7.4.2.2 of H.264.

具体的には、MapOfMbsToSliceGroupsは、以下のように設定することにより複合ビデオ信号のピクチャパラメータセットに組み込むことができる。 Specifically, MapOfMbsToSliceGroups can be incorporated into the picture parameter set of the composite video signal by setting as follows.

num_slice_groups_minus1 = NumAssignedAreas;
/

slice_group_map_type = 6;
// (スライスグループに対するMBの明示的な割当てを示す)

pic_size_in_map_units_minus1 = NumMbs-1;

for(i=0; i<=pic_size_in_map_units_minus1; i++ )
slice_group_id[ i ] = MapOfMbsToSliceGroups[i];
ただし、図7の例示的な割当ての場合、NumAssignedAreasは3であり、NumMbsは396の4倍(CIFの4倍)すなわち1583である。タイプ6(任意の割当て)に代えて、スライスグループマップのタイプ2(1組の矩形と背景)もここで使用できることに留意されたい。 num_slice_groups_minus1 = NumAssignedAreas;
/

slice_group_map_type = 6;
// (indicates explicit allocation of MB to slice group)

pic_size_in_map_units_minus1 = NumMbs-1;

for (i = 0; i <= pic_size_in_map_units_minus1; i ++)
slice_group_id [i] = MapOfMbsToSliceGroups [i];
However, in the example assignment of FIG. 7, NumAssignedAreas is 3, and NumMbs is 4 times 396 (4 times CIF), ie 1583. Note that instead of type 6 (arbitrary allocation), slice group map type 2 (a set of rectangles and background) can also be used here.

入力ビットストリームからのスライスを出力ビットストリーム中の対応するスライスグループに適切に割り当てることを達成するために、スライスヘッダシンタックスがSVC規格により指定されるように与えられた場合、CSVCSは追加のマップを作成する必要がある。この追加のマップは、個々のストリームのマクロブロック(MB)指標と、複合信号のMB指標の間の対応関係マップである。例えば、ストリーム1(図7の710)のマクロブロック指標0は、複合ピクチャ中のMB指標22に対応する。上記で与えられた例に対して、この二次元マップをMapMbIndexと指定すると、MapMbIndex[1] [0]=22となる。 In order to achieve proper allocation of slices from the input bitstream to the corresponding slice groups in the output bitstream, the CSVCS is an additional map if the slice header syntax is given as specified by the SVC standard. Need to create. This additional map is a correspondence map between the individual block macroblock (MB) index and the composite signal MB index. For example, macroblock index 0 of stream 1 (710 in FIG. 7) corresponds to MB index 22 in the composite picture. For the example given above, if this 2D map is specified as MapMbIndex, MapMbIndex [1] [0] = 22.

スライスをスライスグループに割り当てるための手順は以下のようになる。 The procedure for assigning slices to slice groups is as follows.

ストリームn(例えば、図7の例ではn=0、1、2)からのスライスを考え、以下のステップを実施する。 Considering a slice from stream n (for example, n = 0, 1, 2 in the example of FIG. 7), the following steps are performed.

(a)スライスヘッダのビットストリームの構文解析をして、そのスライス中の第1のMB(first_mb_in_slice)の指標を見出す。その数をkとする。 (a) Parsing the bit stream of the slice header to find the index of the first MB (first_mb_in_slice) in the slice. Let that number be k.

(b)MapMbIndexを用いて、複合ピクチャ中のそのMBの対応する指標/位置を決定する。それがMapMbIndex[n] [k]である。 (b) Use MapMbIndex to determine the corresponding index / position of that MB in the composite picture. That is MapMbIndex [n] [k].

(c)7.3.1節/H.264に従って、スライスに対するNALユニットからemulation_prevention_three_byteシンタックス要素を除去する。 (c) Remove the emulation_prevention_three_byte syntax element from the NAL unit for the slice according to section 7.3.1 / H.264.

(d)既存のfirst_mb_in_sliceシンタックス要素を、数MapMbIndex[n] [k]と交換/置換する。 (d) Replace / replace the existing first_mb_in_slice syntax element with the number MapMbIndex [n] [k].

(e)7.3.1節/H.264に従って、emulation_prevention_three_byteシンタックス要素を、再度、NALユニット中に挿入する。 (e) Insert the emulation_prevention_three_byte syntax element into the NAL unit again according to section 7.3.1 / H.264.

上記で述べたステップ(a)から(e)は、複合出力ピクチャ中に含まれることになるすべての入力されるストリームのすべてのスライスに対して繰り返される。 Steps (a) to (e) described above are repeated for all slices of all input streams that will be included in the composite output picture.

引き続き図7を参照すると、未割当ての(すなわち、入力されるストリームのいずれにも割り当てられていない)複合ピクチャ中の領域730に対しては、CSVCS手順は以下のようになる。 Still referring to FIG. 7, for an area 730 in a composite picture that is unassigned (ie, not assigned to any of the incoming streams), the CSVCS procedure is as follows.

第1のまたは最初の複合ピクチャに対しては、以下の諸ステップが実施される。 For the first or first composite picture, the following steps are performed.

(a)この領域中に、圧縮されたMBのビットが含まれるはずのスライスを作成する。所与の限定された組のピクチャサイズ、およびCSVCSの構成の選択肢に対して、このスライスは事前に記憶され、またはその他の形でオンラインで計算され得る。 (a) Create a slice that should contain compressed MB bits in this area. For a given limited set of picture sizes and CSSVC configuration options, this slice may be pre-stored or otherwise calculated online.

(b)スライスタイプ(スライスヘッダ中の)を2(Iスライス)に設定する。 (b) Set the slice type (in the slice header) to 2 (I slice).

(c)このスライスの(スライスヘッダ中で設定された)第1のMBの指標は、複合ピクチャ中の第1の未割当てのMBの位置(上記の例では、これは11である)に対応すべきである。 (c) The index of the first MB (set in the slice header) of this slice corresponds to the position of the first unallocated MB in the composite picture (this is 11 in the example above) Should.

(d)効率的な符号化のためにすべて等しい値であることが好ましい画素値で、未割当ての領域を満たす。この値はグレー値であることが好ましい、すなわち、サンプル値は、上部左コーナMBにおけるIntra_16x16_DC予測モードの効率的な使用のため、128に等しくすべきである。 (d) Fill the unallocated area with pixel values that are preferably all equal for efficient encoding. This value is preferably a gray value, i.e. the sample value should be equal to 128 for efficient use of the Intra_16x16_DC prediction mode in the upper left corner MB.

(e)ここで、Intra16x16としてすべてのMBを圧縮し、対応するMBヘッダ中のmb_typeパラメータをこのモードに設定する。特に、マクロブロックの特定の位置に応じて、そのモード(mb_type)は、以下から選択されるべきである。 (e) Here, all MBs are compressed as Intra16x16, and the mb_type parameter in the corresponding MB header is set to this mode. In particular, depending on the specific position of the macroblock, its mode (mb_type) should be selected from:

(i)I_16x16_0_0_0(その上のMBからの垂直予測)
(ii)I_16x16_1_0_0(その左のMBからの水平予測)
(iii)I_16x16_2_0_0(近傍で利用できるものがない場合のDC予測)
CAVLCが使用される場合、プリファレンスは、I_16x16_0_0_0またはI_16x16_1_0_0のmb_type値に与えられる。CABACが使用される場合、プリファレンスは、I_16x16_2_0_0に与えられ、このmb_type値は、スライス中のすべてのマクロブロックに対して等しく、したがって、CABACはそれを効率的に符号化することができる。 (i) I_16x16_0_0_0 (vertical prediction from MB above)
(ii) I_16x16_1_0_0 (horizontal prediction from the left MB)
(iii) I_16x16_2_0_0 (DC prediction when there is nothing available in the vicinity)
If CAVLC is used, the preference is given to the mb_type value of I_16x16_0_0_0 or I_16x16_1_0_0. If CABAC is used, the preference is given to I_16x16_2_0_0 and this mb_type value is equal for all macroblocks in the slice, so CABAC can encode it efficiently.

引き続き図7を参照すると、未割当ての複合ピクチャにおける領域730の後続するピクチャに対しては、以下の諸ステップが実施される。 With continued reference to FIG. 7, the following steps are performed for subsequent pictures in region 730 in unassigned composite pictures.

(b)スライスタイプ(スライスヘッダ中の)を0(Pスライス)に設定する。 (b) Set the slice type (in the slice header) to 0 (P slice).

(c)このスライス中の第1のMB(first_mb_in_slice)の指標は、複合ピクチャ中の第1の未割当てのMBの位置(図7の例では、これは11である)に対応すべきである。 (c) The index of the first MB (first_mb_in_slice) in this slice should correspond to the position of the first unallocated MB in the composite picture (this is 11 in the example of FIG. 7) .

(d)すべてのマクロブロックタイプmb_typeを、mb_skip_runを設定することにより(CAVLCに対して)、またはmb_skip_flagを1に等しく設定することにより(CABACに対して)、P_Skipに等しく設定する。 (d) Set all macroblock types mb_type equal to P_Skip by setting mb_skip_run (for CAVLC) or by setting mb_skip_flag equal to 1 (for CABAC).

複合出力ピクチャは、すべてのスライスおよびスライスグループにわたり、NALユニットヘッダのtemporal_idおよびdependency_idパラメータで同じ値を有する必要のあることに留意されたい。 Note that the composite output picture must have the same value in the temporal_id and dependency_id parameters of the NAL unit header across all slices and slice groups.

temporal_idの割当ては以下のように取得される。 The temporal_id assignment is obtained as follows:

(a)入力ビットストリームが、その時間構造に関して時間的に同期化されている場合、出力ピクチャは、対応する入力ピクチャに割り当てられたものと同じtemporal_id値が割り当てられる。これは好ましいオペレーションモードである。出力ビデオは、時間的レイヤ化およびエラー耐性処理に関する場合、入力ビデオとして操作される。 (a) If the input bitstream is temporally synchronized with respect to its temporal structure, the output picture is assigned the same temporal_id value as that assigned to the corresponding input picture. This is the preferred mode of operation. The output video is manipulated as the input video when it relates to temporal layering and error resilience processing.

(b)そうではない場合(入力ビットストリームが時間的に同期化されない場合)、出力ピクチャに対するtemporal_idの割当ては、様々な入力ビットストリームで使用されるすべてのインター予測構造を可能にするように実施されなければならない。概して(また実際に)、これは、出力ビットストリームのすべてのピクチャに同じレイヤ番号(temporal_id=0)を割り当てる結果となる。 (b) Otherwise (when the input bitstream is not synchronized in time), the assignment of temporal_id to the output picture is implemented to allow all inter prediction structures used in the various input bitstreams It must be. In general (and actually) this results in assigning the same layer number (temporal_id = 0) to all pictures in the output bitstream.

しかし、CSVCSは、様々な入力ビットストリームの時間的な依存構造を追跡することができる。スライス(および結果として、スライスグループ)は、別々のパケットで送信されるので、パケットベースの再送信、順方向エラー訂正、および概して、SVCS用に設計された任意の技法を含むエラー耐性機構を、CSVCSシステムにおけるスライスに、したがって、スライスグループに適用することができる。 However, CSVCS can track the temporal dependency structure of various input bitstreams. Since slices (and consequently slice groups) are transmitted in separate packets, an error resilience mechanism, including packet-based retransmission, forward error correction, and generally any technique designed for SVCS, It can be applied to slices in a CSVCS system and thus to slice groups.

CSVCSシステムでは、dependency_idを割り当てるための手順は以下のようになる。 In the CSVCS system, the procedure for assigning dependency_id is as follows.

(a)すべてのレイヤにおけるすべての出力ピクチャに対して、同じ値のdependency_idが入力ピクチャ中に存在するように、入力ビットストリームが同期化される場合、dependency_idのこの値、またはシフトされた値が使用される。 (a) If the input bitstream is synchronized so that the same value of dependency_id exists in the input picture for all output pictures in all layers, this value of dependency_id or the shifted value is used.

(b)そうではない場合(dependency_idが異なる場合)、入力ビットストリームのdependency_id値は、複合出力ピクチャの各レイヤに対して、それらが、スライスグループにわたり等しいように調整される。これは、いくつかの入力信号のdependency_id値を増加させること、および余分のベースレイヤを追加することが必要になり得る。 (b) Otherwise (if the dependency_id is different), the dependency_id value of the input bitstream is adjusted for each layer of the composite output picture so that they are equal across slice groups. This may require increasing the dependency_id value of some input signals and adding an extra base layer.

その手順は、図7の例を引き続き参照して理解することができる。その例では、2つのCIF信号(スライスグループ1 710、および2 720)、および1つのQCIF入力信号(スライスグループ0 705)が、4CIF出力ピクチャへと合成される。CIF信号のそれぞれが、空間スケーラビリティを用いて符号化されること、およびQCIF解像度を有するベースレイヤが各信号に提供されるものと仮定する。出力ピクチャのベースレイヤは、(その例では)2つのCIF補強レイヤ入力信号(スライスグループ1 710および2 720、dependency_id=1)の2つのQCIFベースレイヤ(dependency_id=0)をそれぞれ含むCIFピクチャである。さらに、QCIF入力信号(スライスグループ0 705)は、ベースレイヤを有していないものと仮定する。その場合、そのdependency_id値は0に等しく、この同じ信号が、複合出力ピクチャ内で2つのCIF入力信号(スライスグループ1 710および2 720)と同じレイヤで使用される場合、その値を1に増加させる必要がある。したがって、複合出力ピクチャのベースレイヤのために、例えば、さらなるQQCIF(quarter QCIF)のベースレイヤをCSVCSにより作成する必要がある。この生成されたレイヤに含まれるピクチャは、完全に空とすることができる、すなわち、P_Skipマクロブロックだけを含み、レイヤ間予測には使用されない。それは、SVC準拠の復号器が複合出力ピクチャを適正に復号できるようにするだけの目的のために作成され、かつ複合出力ピクチャに追加される。 The procedure can be understood with continued reference to the example of FIG. In that example, two CIF signals (slice group 1 710 and 2 720) and one QCIF input signal (slice group 0 705) are combined into a 4CIF output picture. Assume that each of the CIF signals is encoded using spatial scalability and that a base layer with QCIF resolution is provided for each signal. The base layer of the output picture is a CIF picture that includes (in the example) two QCIF base layers (dependency_id = 0) of two CIF enhancement layer input signals (slice groups 1 710 and 2 720, dependency_id = 1) . Further assume that the QCIF input signal (slice group 0 705) does not have a base layer. In that case, its dependency_id value is equal to 0, and if this same signal is used in the same layer as two CIF input signals (slice groups 1 710 and 2 720) in the composite output picture, its value is increased to 1 It is necessary to let Therefore, for the base layer of the composite output picture, for example, a further QQCIF (quarter QCIF) base layer needs to be created by CSVCS. The pictures included in this generated layer can be completely empty, i.e. only contain P_Skip macroblocks and are not used for inter-layer prediction. It is created and added to the composite output picture only for the purpose of allowing an SVC compliant decoder to properly decode the composite output picture.

空間スケーラビリティが使用される場合、同じ比の空間解像度が、入力信号に対応するスライスグループに対して使用されなければならない。空間解像度の比に応じて、以下のステップが実施される。 If spatial scalability is used, the same ratio of spatial resolution must be used for the slice group corresponding to the input signal. Depending on the spatial resolution ratio, the following steps are performed.

(a)解像度の1つの比が入力信号中に存在する場合(例えば、入力A: QCIF、CIF、4CIF、および入力B: QQVGA、QVGA、VGAなど、ただし、比は2である)、空間解像度間の比は常にマッチする。次いで、これらの解像度は、複合出力ピクチャのすべての空間レイヤで混合することができる。 (a) If one ratio of resolution is present in the input signal (for example, input A: QCIF, CIF, 4CIF, and input B: QQVGA, QVGA, VGA, etc., but the ratio is 2), spatial resolution The ratio between is always matched. These resolutions can then be mixed in all spatial layers of the composite output picture.

(b)そうではない場合(複数の空間解像度比が入力信号中に存在する場合)、複合出力ピクチャのすべてのレイヤに対して、空間解像度の比が同一になることを保証するために、中間的なレイヤを挿入することができる。 (b) If not (if multiple spatial resolution ratios are present in the input signal), to ensure that the spatial resolution ratio is the same for all layers of the composite output picture Specific layers can be inserted.

例えば、空間比1.5および2が共に、複合されるように意図された入力信号中に存在すると仮定する。より正確には、図7を参照すると、CIFスライスグループ1 710入力信号が、2/3CIF解像度を有するベースレイヤを有すること、CIFスライスグループ2 720がQCIFベースレイヤを有すること、またQCIFスライスグループ0がQQCIFベースレイヤを有するものと仮定する。CSVCSは、3つの空間レイヤおよびそれに対応するdependency_id値0、1、および2で動作するように構成されなければならない。CSVCSにより複合出力ピクチャ中に挿入されるこれらの入力信号に対して、中間の人工的な(「ダミー」)レイヤが生成される必要がある。これは、図8で示されており、図7と同じ複合ピクチャレイアウトが使用されているが、入力されるビデオ信号のコンポーネントの対応するレイヤデータを有する低位のレイヤピクチャもまた示されている。スライスグループ2のCIF入力信号832に対して、2/3CIF解像度を有する人工的中間レイヤ822を作成する必要があり、一方、スライスグループ0のQCIF入力信号830に対して、解像度2/3QCIFを有する人工的中間レイヤ820を作成する必要がある。最後に、スライスグループ1のCIF入力信号831に対して、人工的なベースレイヤ811をQCIF解像度で作成する必要がある。これらの人工的レイヤを符号化するための効率的な方法は、P_Skipモードを用いてすべてのマクロブロックを符号化することであり、前に述べたように、非常に効率的に表現され得るイントラ符号化されたグレー値を含むことのできる最初のピクチャのマクロブロックから以外は、それをレイヤ間予測のために使用しないことである。 For example, assume that spatial ratios 1.5 and 2 are both present in the input signal intended to be composited. More precisely, referring to FIG. 7, CIF slice group 1 710 input signal has a base layer with 2/3 CIF resolution, CIF slice group 2 720 has a QCIF base layer, and QCIF slice group 0 Assume that has a QQCIF base layer. CSVCS must be configured to work with three spatial layers and their corresponding dependency_id values 0, 1, and 2. For these input signals inserted into the composite output picture by CSVCS, an intermediate artificial (“dummy”) layer needs to be generated. This is shown in FIG. 8, and the same composite picture layout as in FIG. 7 is used, but the lower layer picture with the corresponding layer data of the components of the incoming video signal is also shown. It is necessary to create an artificial intermediate layer 822 having 2/3 CIF resolution for slice group 2 CIF input signal 832, while having resolution 2/3 QCIF for slice group 0 QCIF input signal 830 An artificial intermediate layer 820 needs to be created. Finally, it is necessary to create an artificial base layer 811 with the QCIF resolution for the CIF input signal 831 of slice group 1. An efficient way to encode these artificial layers is to encode all macroblocks using P_Skip mode, and as previously mentioned, an intra that can be represented very efficiently. It is not to use it for inter-layer prediction except from the first picture macroblock that can contain coded gray values.

本明細書のさらなる説明は、1つまたは複数の受信エンドポイントに送信される複合出力信号に対する、送信エンドポイントから受信した入力されるピクチャの同期化に関する。 The further description herein relates to the synchronization of incoming pictures received from a transmitting endpoint with respect to a composite output signal transmitted to one or more receiving endpoints.

複合出力ピクチャの一部である入力されるフレームの少なくとも1つが、それ自体の各ストリームに対する参照ピクチャとして使用される可能性が非常に高いので、CSVCSは、出力されるビットストリーム中の参照ピクチャとして、出力される複合ピクチャのすべてにフラグを付す必要があることに留意されたい。さらに、1つまたは複数の送信エンドポイントから入力されるピクチャデータは、CSVCSに非同期的に届くので、入力されるビットストリーム中、および複合出力ビットストリーム中の同じピクチャに対して異なるフレーム番号を有する可能性がある。これは、各スライスグループ中で前のピクチャに対する適正な参照が、適正に確立されない場合があるので、複合ピクチャが受信参加者で復号されたとき、矛盾を生ずる可能性がある。 Since at least one of the incoming frames that are part of the composite output picture is very likely to be used as a reference picture for each of its own streams, CSVCS is used as a reference picture in the output bitstream. Note that it is necessary to flag all of the output composite pictures. In addition, picture data coming from one or more sending endpoints arrives asynchronously on the CSVCS, so it has different frame numbers for the same picture in the incoming bitstream and in the composite output bitstream there is a possibility. This can cause inconsistencies when the composite picture is decoded at the receiving participant, because the proper reference to the previous picture in each slice group may not be properly established.

したがって、CSVCSは2つの問題に対処する必要がある。第1に、入力される異なるストリームのフレームが、時間的に非同期でCSVCSに届いた場合、複合ピクチャを作成すること。第2に、スライスグループを含むピクチャが(送られる複合信号に対して)予測のための適正な参照を維持していることを確認することである。 Therefore, CSVCS needs to address two issues. First, create composite pictures when frames of different incoming streams arrive at the CSVCS asynchronously in time. Second, make sure that the picture containing the slice group maintains a proper reference for prediction (for the composite signal sent).

ピクチャの同期化は、以下の2つの技法の一方により実施され得る。 Picture synchronization may be performed by one of the following two techniques.

1.最大のサンプリング周波数を有する入力ストリームのサンプリング周波数以上の、CSVCSにおける所与のサンプリング周波数に対するピクチャ到達時間に対応するウィンドウを用いて、入力されるピクチャをバッファリングする。 1. Buffer the incoming picture using a window corresponding to the picture arrival time for a given sampling frequency in CSVCS that is equal to or higher than the sampling frequency of the input stream with the maximum sampling frequency

2.ΔTの期間を有する、CSVCSにおけるサンプリング時間に対応するウィンドウを用いて、入力されるピクチャをバッファリングする、ただし、ΔTは複合信号のフレームレート(FPS)の逆数である。時間サンプルごとに送り出される必要のある新しい複合ピクチャを作成するために、最後のW時間単位内にCSVCSに到達した新しいコンテンツを調べる。ウィンドウの幅Wは、例えば、1/FPSであるように選択することができる。 2. Buffer the incoming picture with a window corresponding to the sampling time in CSVCS, which has a period of ΔT, where ΔT is the inverse of the frame rate (FPS) of the composite signal. In order to create a new composite picture that needs to be sent every time sample, the new content that has reached the CSVCS within the last W time unit is examined. The window width W can be selected to be, for example, 1 / FPS.

以下のアルゴリズムは、ピクチャの同期化のための例示的なCSVCSオペレーションを示す。 The following algorithm illustrates an exemplary CSVCS operation for picture synchronization.

frame_num=0
for t=ΔT, 2ΔT, ...,
n入力ビデオストリームごとに、
if (ストリームnに対して到来した新しいスライスデータが(t, t-W]内である)
このスライスデータを対応するスライスグループに割り当てる
グループ中の各スライスに対して、ref_pic_list_reordering()を適用する
このストリームに対して、マップMapOrigIndおよびMapCompIndを更新する
else
対応するスライスグループ中のこのスライスデータを(一般のデータを用いて)スキップする
グループ中のすべてのスライスに対して、スライスヘッダ中にframe_numを設定する
この複合ピクチャを送信する
フレームカウンタを更新する: frame_num++
ただし、ステートメント:
グループ中の各スライスに対して、ref_pic_list_reordering()を適用する
このストリームに対して、マップMapOrigIndおよびMapCompIndを更新する
は、複合出力ピクチャ中で正しい参照ピクチャデータを維持する問題に関係しており、ここでそれを説明する。 frame_num = 0
for t = ΔT, 2ΔT, ...,
n For each input video stream,
if (new slice data coming for stream n is in (t, tW))
Assign this slice data to the corresponding slice group Apply ref_pic_list_reordering () to each slice in the group Update the map MapOrigInd and MapCompInd for this stream
else
Skip this slice data in the corresponding slice group (using general data) Set frame_num in the slice header for all slices in the group Send this composite picture Update the frame counter: frame_num ++
However, the statement:
Apply ref_pic_list_reordering () to each slice in the group Updating the map MapOrigInd and MapCompInd for this stream is related to the problem of maintaining the correct reference picture data in the composite output picture, here I will explain it.

スライスヘッダ中に提供されるref_pic_list_reordering()シンタックス、およびマップMapOrigIndおよびMapCompIndは、新しいコンテンツがサーバに届いたときはいつも、適切な参照ピクチャリストを作成するために使用される。特に、CSVCSは、スライスグループ(入力されるビデオストリーム)に対する元の参照ピクチャ指標が、出力される複合ピクチャ指標にどのようにマップされるかを常に追跡する必要がある。具体的には、ストリームの新しいスライスデータがCSVCSに届いたときはいつも、サーバは、その元の指標をMapOrigIndexと呼ばれるマップの先頭に配置し、またその複合ピクチャ指標をMapCompIndexと呼ばれるマップの先頭に配置し、同時に、元のエントリを1つだけ位置を右にシフトする。さらに、これらのマップの長さがあるポイントで特定の長さを超えた場合、それ以降、サーバは、新しいエントリが先頭に追加されるときはいつも、これら2つのマップ中の最後のエントリを単に廃棄するはずである。したがって、これらのマップは有限容量のスタックとして動作する。 The ref_pic_list_reordering () syntax provided in the slice header and the maps MapOrigInd and MapCompInd are used to create an appropriate reference picture list whenever new content arrives at the server. In particular, the CSVCS must always keep track of how the original reference picture index for a slice group (input video stream) is mapped to the output composite picture index. Specifically, whenever new slice data for a stream arrives at CSVCS, the server places its original index at the beginning of a map called MapOrigIndex, and its composite picture index at the beginning of a map called MapCompIndex. Place and simultaneously shift the position of one original entry to the right. In addition, if the length of these maps exceeds a certain length at some point, then the server will simply change the last entry in these two maps whenever a new entry is prepended. Should be discarded. Therefore, these maps operate as a finite capacity stack.

CSVCSは、入力されるストリームのそれぞれに対して、1対のこのようなマップを維持する。これらのマップは、次いで、二次元配列として表すことができ、そのマップの第1の指標は、ストリーム指標(図7の例では、n=0、1、または2)を指しており、また第2の指標のサイズは、ゼロと、何らかの所定の数(MaxNumRefFrame)の間の範囲であり、それは、入力されるストリームに対して、どれだけ過去のフレームを追跡したいかを指定する。 CSVCS maintains a pair of such maps for each incoming stream. These maps can then be represented as a two-dimensional array, with the first index of the map pointing to the stream index (n = 0, 1, or 2 in the example of FIG. 7), and The size of the index of 2 ranges between zero and some predetermined number (MaxNumRefFrame), which specifies how much past frames you want to track for the incoming stream.

ストリームnに対する新しいピクチャスライスデータが届き、適切なスライスグループの複合ピクチャ中に配置されていると仮定する。そのグループ中のスライスごとに、CSVCSは、スライスヘッダデータに対して以下のオペレーションを実施する。 Assume that new picture slice data for stream n arrives and is placed in the composite picture of the appropriate slice group. For each slice in the group, CSVCS performs the following operations on the slice header data:

// 順序入れ換えがすでに実施されているかどうか検査する
if ( ref_pic_list_reordering_flag_l0 = = 1) do
// このフラグをスライスヘッダから読み取ることができる
index = 0; CurrPic = frame_num;
read first reordering_of_pic_nums_idc from the header
while ( reordering_of_pic_nums_idc != 3 ) do
if (reordering_of_pic_nums_idc = = 0 || reordering_of_pic_nums_idc = = 1) do
// 短期参照ピクチャ
read abs_diff_pic_num_minus1 from the slice header
if (reordering_of_pic_nums_idc = = 0)
PredOrigPic=MapOrigInd[n][index] - ( abs_diff_pic_num_minus1 + 1)
else
PredOrigPic=MapOrigInd[n][index] + ( abs_diff_pic_num_minus1 + 1)
compIndex = find index( MapOrigInd[n][:] = = PredOrigPic )
PredCompPic = MapComInd[n][compIndex];
if (CurrPic > PredCompPic)
abs_diff_pic_num_minus1 = CurrPic - PredCompPic - 1;
write reordering_of_pic_nums_idc = 0 in the slice header;
// 既存の reordering_of_pic_nums_idcの値を置き換える
else
abs_diff_pic_num_minus1 = PredCompPic - CurrPic - 1;
write reordering_of_pic_nums_idc = 1 in the slice header;
// 既存の reordering_of_pic_nums_idcの値を置き換える
write abs_diff_pic_num_minus1 in the slice header;
index++; // 次のエントリに移動する
CurrPic = PredCompPic;
else if ( reordering_of_pic_nums_idc = = 2 ) do
read long_term_pic_num from the slice header
index_long_term = find ( MapOrigInd[n][:] = = long_term_pic_num )
write MapCompInd[n][index_long_term] in the slice header
read next reordering_of_pic_nums_idc from the slice header
end (while ( reordering_of_pic_nums_idc != 3 ) )
else
// ( ref_pic_list_reordering_flag_l0 = = 0) // 要求された前の順序入れ換えはない
set ref_pic_list_reordering_flag_l0 (= 1) in the slice header
CurrPic = frame_num;
for index = 0, ..., MaxNumRefFrame-1
if (CurrPic > MapCompInd[n][index])
abs_diff_pic_num_minus1 = CurrPic - MapCompInd[n][index] - 1;
write reordering_of_pic_nums_idc = 0 in the slice header;
else
abs_diff_pic_num_minus1 = MapOCompInd[n][index] - CurrPic - 1;
write reordering_of_pic_nums_idc = 1 in the slice header;
write abs_diff_pic_num_minus1 in the slice header;
CurrPic = MapCompInd[n][index];
write reordering_of_pic_nums_idc = 3;
end (of the if-else-check on existing ref_pic_list_reordering_flag_l0 flag)
ここで述べたオペレーションは、Pスライスだけが存在するものと仮定していることに留意されたい。Bスライスの場合は、ref_pic_list_reordering()シンタックスによりスライスヘッダ中に提供される(スライスヘッダ中でref_pic_list_reordering_flag_11を設定する)類似の手順が適用される。さらに、参照ピクチャの指標は、サーバに届いた元も最近のもの(index=0)から、過去に届いた最も遠いもの(index=MaxNumRefFrame-1)までが記憶されることに留意されたい。 // check if reordering has already been done
if (ref_pic_list_reordering_flag_l0 = = 1) do
// This flag can be read from the slice header
index = 0; CurrPic = frame_num;
read first reordering_of_pic_nums_idc from the header
while (reordering_of_pic_nums_idc! = 3) do
if (reordering_of_pic_nums_idc = = 0 || reordering_of_pic_nums_idc = = 1) do
// Short-term reference picture
read abs_diff_pic_num_minus1 from the slice header
if (reordering_of_pic_nums_idc = = 0)
PredOrigPic = MapOrigInd [n] [index]-(abs_diff_pic_num_minus1 + 1)
else
PredOrigPic = MapOrigInd [n] [index] + (abs_diff_pic_num_minus1 + 1)
compIndex = find index (MapOrigInd [n] [:] = = PredOrigPic)
PredCompPic = MapComInd [n] [compIndex];
if (CurrPic> PredCompPic)
abs_diff_pic_num_minus1 = CurrPic-PredCompPic-1;
write reordering_of_pic_nums_idc = 0 in the slice header;
// replace the existing reordering_of_pic_nums_idc value
else
abs_diff_pic_num_minus1 = PredCompPic-CurrPic-1;
write reordering_of_pic_nums_idc = 1 in the slice header;
// replace the existing reordering_of_pic_nums_idc value
write abs_diff_pic_num_minus1 in the slice header;
index ++; // move to next entry
CurrPic = PredCompPic;
else if (reordering_of_pic_nums_idc = = 2) do
read long_term_pic_num from the slice header
index_long_term = find (MapOrigInd [n] [:] = = long_term_pic_num)
write MapCompInd [n] [index_long_term] in the slice header
read next reordering_of_pic_nums_idc from the slice header
end (while (reordering_of_pic_nums_idc! = 3))
else
// (ref_pic_list_reordering_flag_l0 = = 0) // no previous reordering requested
set ref_pic_list_reordering_flag_l0 (= 1) in the slice header
CurrPic = frame_num;
for index = 0, ..., MaxNumRefFrame-1
if (CurrPic> MapCompInd [n] [index])
abs_diff_pic_num_minus1 = CurrPic-MapCompInd [n] [index]-1;
write reordering_of_pic_nums_idc = 0 in the slice header;
else
abs_diff_pic_num_minus1 = MapOCompInd [n] [index]-CurrPic-1;
write reordering_of_pic_nums_idc = 1 in the slice header;
write abs_diff_pic_num_minus1 in the slice header;
CurrPic = MapCompInd [n] [index];
write reordering_of_pic_nums_idc = 3;
end (of the if-else-check on existing ref_pic_list_reordering_flag_l0 flag)
Note that the operations described here assume that there are only P slices. For B slices, a similar procedure is applied in the slice header (setting ref_pic_list_reordering_flag_11 in the slice header) with the ref_pic_list_reordering () syntax. Furthermore, it should be noted that the index of the reference picture is stored from the latest one that arrived at the server (index = 0) to the farthest one that arrived in the past (index = MaxNumRefFrame-1).

新しいピクチャデータが、送信参加者のビデオストリームから届いた後、CSVCSは、その後に続くオペレーションでピクチャを使用できるように、その指標を(それが参照ピクチャである場合)マップMapOrigIndおよびMapCompIndに登録する必要がある。具体的には、以下のオペレーションが実施される。まず、CSVCSは、ストリームnに対する新しいピクチャデータの任意のスライスヘッダから、元のフレーム番号(「orig_frame_num」)を抽出する。次いで、MapOrigIndおよびMapCompIndは、以下のように更新される(スタック挿入)。 After new picture data arrives from the sending participant's video stream, the CSVCS registers its index (if it is a reference picture) in the maps MapOrigInd and MapCompInd so that the picture can be used in subsequent operations. There is a need. Specifically, the following operations are performed. First, the CSVCS extracts an original frame number (“orig_frame_num”) from an arbitrary slice header of new picture data for the stream n. MapOrigInd and MapCompInd are then updated as follows (stack insertion):

for index = MaxNumRefFrame - 1, ..., 1
MapOrigInd[n][index]) = MapOrigInd [n][index-1])
MapCompInd[n][index]) = MapCompInd[n][index-1])
MapOrigInd[n][index-1]) = orig_frame_num;
MapCompInd[n][index-1]) = frame_num;
送信エンドポイントから受信した入力ビデオ信号の時間符号化の依存構造に互換性がある場合、フレームレートが異なる場合であっても、CSVCSはそれらを完全に整列させることが可能である。例えば、国際特許出願第PCT/US06/028365号のスレッド化ピクチャ符号化構造が使用されること、および2つの入力される参加者からのピクチャが、1つは、合計で毎秒30フレームのL0、L1、およびL2の3レイヤで、また第2のものは、合計で毎秒15フレームのL0およびL1の2つのレイヤで合成されるものと仮定する。CSVCSは、第2の参加者に対して人工的な時間レイヤL2'を作成して、複合出力ピクチャを構築するように進み、したがって、第1の参加者のL0、L1、およびL2ピクチャが、それぞれ、第2の参加者のL0、L1、およびL2'ピクチャと同じ出力ピクチャ中に合成されるようにすることができる。こうすることにより、複合出力ビデオピクチャ内にスレッド化パターンを保存することが可能になる。 for index = MaxNumRefFrame-1, ..., 1
MapOrigInd [n] [index]) = MapOrigInd [n] [index-1])
MapCompInd [n] [index]) = MapCompInd [n] [index-1])
MapOrigInd [n] [index-1]) = orig_frame_num;
MapCompInd [n] [index-1]) = frame_num;
If the input video signal received from the transmitting endpoint is compatible in time coding dependency structure, CSVCS can perfectly align them even if the frame rates are different. For example, the threaded picture coding structure of International Patent Application No. PCT / US06 / 028365 is used, and pictures from two input participants, one of which is a total of 30 frames per second L0, Assume that L1 and L2 are combined in three layers, and the second is combined in two layers of L0 and L1 in total 15 frames per second. CSVCS proceeds to create an artificial temporal layer L2 ′ for the second participant to build a composite output picture, so the L0, L1, and L2 pictures of the first participant are Each can be combined into the same output picture as the L0, L1, and L2 ′ pictures of the second participant. This makes it possible to save the threading pattern in the composite output video picture.

CSVCSはまた、空間解像度の切換え、アップサンプリング、ならびに複合出力ビデオ信号中で入力信号のシフトを行うことができる。 CSVCS can also perform spatial resolution switching, upsampling, and input signal shifting in the composite output video signal.

アップサイジング(1レイヤによる)は、すべてのレイヤに、すなわち、対応するスライスグループにIスライス内のイントラマクロブロックを送ることにより実現される。dependency_idの値が、上記で述べたように調整される必要があり、また異なるdependency_id値にわたる動き補償が、SVC準拠の復号器では許容されないため、すべてのイントラが必要である。その場合、対応するスライスグループは、複合出力ピクチャのより広い領域をカバーする。複合出力ピクチャ内の他のスライスグループは、そのためにシフトされる必要があり得る。イントラデータは、CSVCSそれ自体で計算され得るが、その場合、ベースレイヤの最も低い時間レベルを少なくとも復号する必要があり、あるいはイントラデータは、CSVCSからの要求に応じてエンドポイントにより作成することができる。ダウンサイジングは、アップサイジングと同様な方法で実施される。 Upsizing (by one layer) is realized by sending intra macroblocks in the I slice to all layers, ie to the corresponding slice group. All intras are required because the value of dependency_id needs to be adjusted as described above, and motion compensation across different dependency_id values is not allowed in SVC compliant decoders. In that case, the corresponding slice group covers a wider area of the composite output picture. Other slice groups in the composite output picture may need to be shifted for that purpose. Intra data can be calculated by the CSVCS itself, in which case at least the lowest time level of the base layer needs to be decoded, or intra data can be created by endpoints upon request from the CSVCS it can. Downsizing is performed in the same manner as upsizing.

送信エンドポイントから受信した特定のビデオ信号のアップサンプリングは、CSVCSで生成された追加の補強レイヤを挿入することにより実施することができ、その場合、すべてのマクロブロックは、コンテンツが低位レイヤのマクロブロックから単に複製されて符号化される。参加者のビデオ信号中に追加のレイヤを含めることは、本明細書で述べられる技法を用いて、複合出力ピクチャの全体のスケーラビリティ構造の再編成を必要とする可能性がある。 Upsampling of a specific video signal received from the sending endpoint can be performed by inserting an additional enhancement layer generated by CSVCS, in which case all macroblocks are macros with lower layer content. It is simply duplicated from the block and encoded. Including additional layers in the participant's video signal may require reorganization of the overall scalability structure of the composite output picture using the techniques described herein.

入力信号をシフトさせることは、複数のマクロブロックにより行われることが好ましい。受信者は、ユーザインターフェース要求(例えば、マウスのドラッグ)を用いて、ピクチャをシフトすることができる。CSVCSは、それに従って動きベクトルを調整することにより、シフトを調節する(16の整数倍を追加/減算する-サンプル位置)。動きベクトルは、通常、差分的に符号化されること、またこの場合、最初の動きベクトルの値だけが変更される必要のあることが最も可能性の高いことに留意されたい。 The shifting of the input signal is preferably performed by a plurality of macroblocks. The recipient can shift the picture using a user interface request (eg, dragging the mouse). CSVCS adjusts the shift by adjusting the motion vector accordingly (add / subtract an integer multiple of 16-sample position). Note that motion vectors are usually encoded differentially, and in this case it is most likely that only the value of the first motion vector needs to be changed.

本発明の好ましい実施形態と考えられるものについて述べてきたが、当業者であれば、本発明の趣旨から逸脱することなく、さらに変更および修正をそれに加え得ること、および本発明の趣旨に含まれるこのような変更および修正のすべての特許請求を意図していることが理解されよう。 While what has been considered as a preferred embodiment of the present invention has been described, those skilled in the art will be able to make further changes and modifications thereto without departing from the spirit of the present invention and are included in the spirit of the present invention. It will be understood that all such changes and modifications are intended to be claimed.

本発明のシステムおよび方法は、ハードウェアとソフトウェアの任意の適切な組合せを用いて実施できることもまた理解されよう。前述のシステムおよび方法を実施し、かつ動作させるためのソフトウェア(すなわち、命令)は、ファームウェア、メモリ、記憶デバイス、マイクロコントローラ、マイクロプロセッサ、集積回路、ASICS、オンラインでダウンロード可能な媒体、および他の利用可能な媒体を、限定することなく含むことのできるコンピュータ可読媒体上で提供され得る。 It will also be appreciated that the systems and methods of the present invention can be implemented using any suitable combination of hardware and software. Software (ie, instructions) for implementing and operating the aforementioned systems and methods includes firmware, memory, storage devices, microcontrollers, microprocessors, integrated circuits, ASICS, online downloadable media, and other Available media may be provided on computer readable media that may include, without limitation ,.

本発明の原理に従って、複合スケーラブルテレビ会議サーバ(CSVCS)が、スケーラブルなビデオおよびオーディオデータを、エンドポイント送信装置からクライアント受信装置に送達するように構成されたテレビ会議システムの概略図である。1 is a schematic diagram of a video conference system in which a complex scalable video conference server (CSVCS) is configured to deliver scalable video and audio data from an endpoint transmitter to a client receiver in accordance with the principles of the present invention. FIG. 本発明の原理に従って、出力ビデオピクチャをスライスグループへと例示的に区分することを示すブロック図である。FIG. 3 is a block diagram illustrating exemplary partitioning of output video pictures into slice groups in accordance with the principles of the present invention. 本発明の原理に従って、入力ビデオを、出力ビデオピクチャ中の様々なスライスグループに例示的に割り当てることを示すブロック図である。FIG. 4 is a block diagram illustrating exemplary assignment of input video to various slice groups in an output video picture in accordance with the principles of the present invention. 本発明の原理による、時間レイヤに対する例示的なレイヤ化ピクチャ符号化構造を示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary layered picture coding structure for the temporal layer, in accordance with the principles of the present invention. 本発明の原理による、SNRまたは空間的な補強レイヤに対する例示的なレイヤ化ピクチャ符号化構造を示すブロック図である。FIG. 3 is a block diagram illustrating an exemplary layered picture coding structure for an SNR or spatial enhancement layer in accordance with the principles of the present invention. 本発明の原理による、ベースレイヤおよび補強レイヤに対して異なる予測経路を有する、ベースレイヤ、時間的補強レイヤ、およびSNRもしくは空間的補強レイヤのための例示的なレイヤ化ピクチャ符号化構造を示すブロック図である。FIG. 4 is a block diagram illustrating an exemplary layered picture coding structure for a base layer, temporal enhancement layer, and SNR or spatial enhancement layer, having different prediction paths for the base layer and enhancement layer, in accordance with the principles of the present invention. FIG. 本発明の原理による、スライスグループベースの合成プロセスにおいて、出力ビデオピクチャをスライスグループに例示的に区分することを示すブロック図である。FIG. 3 is a block diagram illustrating exemplary partitioning of output video pictures into slice groups in a slice group based compositing process in accordance with the principles of the present invention. 本発明の原理による、異なる空間スケーラビリティ比が組み合わされた、CSVCSから送信された出力ビデオ信号の合成における人工的レイヤの構成のための例示的な構造を示すブロック図である。FIG. 4 is a block diagram illustrating an exemplary structure for the construction of an artificial layer in the synthesis of an output video signal transmitted from a CSVCS, combined with different spatial scalability ratios, in accordance with the principles of the present invention.

Claims

A multi-endpoint video signal conferencing system for conducting a videoconference between a plurality of endpoints via a communication network,
For at least one communication channel comprises a conference bridge, each being linked to at least one receiving endpoint Contact and at least one transmitting endpoint,
Wherein said at least one transmitting endpoint, a single layer encoding formats, and using a coding format selected from the group consisting of scalable video coding format, and transmits the digital video signal encoded,
Wherein the at least one receiving endpoint can in decoding at least one digital video stream coded in a scalable video coding format,
The conference bridge is configured to combine an input video signal received from the at least one transmission endpoint into a composite signal that is a single composite encoded digital video output signal , the combination comprising (i) Allocating a portion of the region of the composite signal to the at least one transmission endpoint to be included in the composite signal; and (ii) one or more higher than that to be used for the picture to be synthesized The transmission endpoint corresponding to resolution and / or data that does not need to be decoded at the resolution to be used for the picture to be synthesized and / or at least one transmission endpoint not included in the composite signal Discarding any input video signal data received from (iii) the at least one transmission entity not discarded. Changing the header information of the video signals received from the point made by,
The conference bridge is further configured to forward the single composite encoded digital video output signal to the at least one receiving endpoint.

The at least one receiving endpoint is capable of decoding video encoded in an H.264 SVC scalable video encoding format, and the composite for a transmission endpoint intended to be included in the composite signal; the allocation of part of the area of the output picture, the is performed by defining a slice group map in the picture parameter set No. double Goshin, each transmitting endpoint corresponds to one slice group,
The allocation of a portion of a region of the composite signal to the transmission endpoints, by transmitting the picture parameter set to said at least one receiving endpoint, the transmitted to the at least one receiving endpoint to claim 1 The described system.

The system of claim 2 , wherein the picture parameter set is configured to be carried in-band or out-of-band to the one or more receiving endpoints.

The composite signal is
When at least one of the input pictures received from the transmission endpoint included in the composite signal is flagged as used-for-reference (used as a reference), as used-for-reference, Also, if all of the input pictures received from the sending endpoint included in the composite signal are flagged as not-used-for-reference (not used as a reference), not-used- Further configured to be flagged as a for-reference,
When the composite signal is flagged as used-for-reference, a reference frame reordering command is used to ensure proper operation of a reference picture buffer at the one or more receiving endpoints. The system of claim 2 , wherein the system is inserted into the slice of a picture subsequently received from the transmitting endpoint prior to its transmission to one receiving endpoint.

The NAL extension header for the SVC of the NAL unit of the composite signal is
The same dependency_id value corresponding to the highest scalable coding layer present in the composite signal is used for the NAL unit of the composite output picture and is the same for the next lower layer NAL unit. Is set to use the next lower dependency_id value, and
temporal_level (time level) is
If the incoming pictures from the at least one transmission endpoint are combined such that the time levels are synchronized, the same temporal_level value is for the NAL unit corresponding to the highest scalable coding layer For the next lower layer, the next lower temporal_level value is used, and the incoming picture from the at least one transmission endpoint is combined so that the time level is synchronized The system of claim 2 , wherein a value of 0 is used for all NAL units of the composite output picture if not.

The system of claim 1 , wherein the assignment of a particular portion of a region of the composite output video picture to a video signal of a particular transmission endpoint by the conference bridge is predefined.

The assignment of a particular portion of the region of the composite signal to a video signal of a particular transmission endpoint;
A request from the receiving endpoint for a specific spatial resolution;
A request from the receiving endpoint to determine a specific spatial position in the composite signal ;
Based on the combinations thereof, it is dynamically performed by the conference bridge system according to claim 1.

For a specific transmitting endpoint video signal, the conference bridge assignment of specific portions of the area of the composite signal, the taking into account at least one receiving endpoint decoding capability or resolution preferences, by the conference bridge The system of claim 1 , wherein the system is implemented.

The conference bridge, is in La,
Sending pre-encoded slice data instructing the at least one receiving endpoint to repeat data from a previous picture;
In order to ensure that proper reference picture selection is performed for subsequent pictures, a reference picture list reorder command is sent to the input video signal before being sent to the at least one receiving endpoint. The system of claim 1 , wherein the system is configured to accommodate when a new picture of the input video signal does not arrive in time for transmission by inserting into the picture header of a subsequent picture.

The conference bridge is further configured to decode at least the lowest time level of the lowest spatial and quality resolution of the video signal received from the at least one transmission endpoint, and the conference bridge further If the composite picture configuration for an existing receiving endpoint needs to be changed, so as to generate an intra coding for the video signal of the affected sending endpoint, and the intra coding the place of the corresponding coded picture data received from the transmitting endpoint, system according constituted, in claim 1 to be sent to the receiving endpoint.

A plurality of conference bridges in a cascade configuration, wherein at least one conference bridge that is not the last in the cascade configuration ,
Forward the composite coded picture received from the previous conference bridge in the cascade configuration to another conference bridge , or the composite coded picture received from the previous conference bridge in the cascade configuration The system of claim 1, configured to disassemble and re-synthesize using a different layout before transferring them to other conference bridges .

The conference bridge is configured to receive a request from the receiving endpoint to determine the desired spatial resolution of the output video signal, according to claim 1, wherein the system.

The system of claim 1, wherein the conference bridge is configured to include source identification information or other information for display via one of an in-band bitstream and an out-of-band bitstream.

30
The conference bridge, the information conveyed Source Identification information or other, (1) pixel of the partial regions of the composite signals allocated to each participant in signal, (2) the transmission participants of the video The system of claim 1, wherein the system is configured to overwrite one of pixels of a portion of the region of the composite signal that is not assigned to any of the signals.

The conference bridge further comprises:
Be statistically multiplexed video signals from a plurality of transmitting endpoints, and the order double Goshin shift the average larger video picture in No. (stagger), the composite video signal received from the transmitting endpoint The conference system of claim 1 , wherein the conference system is configured to comply with a bandwidth condition by at least one of synchronizing transmissions.

The conference bridge further comprises:
The encoded picture data received from the at least one transmission endpoint is replaced with the encoded data indicated by the at least one reception endpoint so as to duplicate corresponding pixel data from a previous picture. that change the bit rate of the composite output signal, and the output bit rate, so that it can match the desired properties, the replacement the encoded data is configured to transmit a conference system according to claim 1.

The conference of claim 1, wherein the conference bridge is further configured to generate an input video signal by generating artificial layer data for at least one of the transmitted endpoint video signals. system.

A method for conducting a video conference between a plurality of endpoints via a communication network,
At least one communication channel, comprising the steps of using the conference bridge, each being linked to at least one receiving endpoint Contact and at least one transmitting endpoint,
Transmitting encoded digital video from at least one transmission endpoint in an encoding format selected from the group consisting of a single layer encoding format and a scalable video encoding format;
In the conference bridge, the step of combining the input video signals received from the transmitting endpoint, encoded in a scalable video coding format, into a composite signal which is a single double Gode digital video output signal,
Transferring the single composite encoded digital video output signal to at least one receiving endpoint ;
Synthesizing into the composite signal comprises:
Assigning a portion of the region of the composite signal to each transmitting endpoint intended to be included in the composite signal;
A resolution higher than that intended for the synthesized picture, data that does not need to be decoded at the resolution intended for the synthesized picture, and a transmission endpoint that is not included in the composite signal Discarding input video signal data received from the transmitting endpoint corresponding to one;
Changing the remaining data of the input encoded video signal by changing header information to form proper data of the composite output video signal;
Including methods.

The at least one receiving endpoint is capable of decoding video encoded in an H.264SVC scalable video encoding format, and a portion of the composite output picture region is included in the composite signal Said step of assigning to each sending endpoint intended to
Defining a slice group map in a picture parameter set of the composite output signal, wherein each transmission endpoint corresponds to one slice group;
The allocation of parts of the regions of the composite signal for a particular transmitting endpoint, in order to transmit to the at least one receiving endpoint, by transmitting the picture parameter set to said at least one receiving endpoint you run, the method according to claim 18.

The method of claim 18 , further comprising conveying the picture parameter set in-band or out-of-band to at least one receiving endpoint.

When at least one of the input pictures received from the transmission endpoint included in the composite signal is flagged as used-for-reference, it is used as a used-for-reference, and the composite output picture If all of the input pictures received from the sending endpoint are flagged as not-used-for-reference, flag the composite signal as not-used-for-reference Further comprising:
When the composite signal is flagged as used-for-reference, a reference frame reordering command is used to guarantee proper operation of the reference picture buffer at the at least one receiving endpoint. The method of claim 18 , wherein the method is inserted into the slice of a picture subsequently received from the transmitting endpoint prior to its transmission to the receiving endpoint.

The same dependency_id value corresponding to the highest scalable coding layer present in the composite signal, the being used for the NAL units of the composite signal, and with respect to the NAL unit of a lower layer to the next, the same Setting a NAL extension header for the SVC of the NAL unit of the composite output picture so that the next lowest dependency_id value is used,
If the incoming pictures from the at least one transmission endpoint are combined so that the time levels are synchronized, the same temporal_level value is given to the NAL unit corresponding to the highest scalable coding layer. For the next lower layer, the next lower temporal_level value is used, and the incoming picture from the at least one transmission endpoint is synchronized in the time level. 19. The method of claim 18 , further comprising setting the temporal_level such that a value 0 is used for all NAL units of the composite signal if not synthesized.

The method of claim 18 , wherein the step of assigning a particular portion of the region of the composite signal to a video signal of a particular transmission endpoint by the conference bridge is predefined.

The parts of the regions of the composite signal, said step of assigning a video signal of a specific transmitting endpoint,
A request from the receiving endpoint for a specific spatial resolution;
A request from the receiving endpoint to determine a specific spatial position in the composite output picture;
19. The method of claim 18 , wherein the method is dynamically performed by the conference bridge based on a combination thereof.

Wherein the parts of the regions of the composite signal during said step of assigning to the video signal of each transmitting endpoint, further comprising said at least one receiving endpoint decoding capability or contemplates step resolution preferences, claim 18 The method described in 1.

If the conference bridge, a new picture of the entering force the video signal does not arrive in time for the time for transmission,
Sending pre-encoded slice data instructing the at least one receiving endpoint to repeat data from a previous picture;
In order to ensure that proper reference picture selection is performed for subsequent pictures, a reference picture list reorder command is sent to the input video signal before being sent to the at least one receiving endpoint. Inserting into the picture header of subsequent pictures;
By the execution, configured as response method according to claim 18.

The conference bridge is further configured to decode at least the lowest time level of the lowest spatial and quality resolution of the video signal received from the at least one transmission endpoint, the method further comprising:
Generating, in the conference bridge , intra coding for the video signal of the affected sending endpoint if the composite picture configuration for an existing receiving endpoint needs to be changed;
19. The method of claim 18 , comprising transmitting the intra coding to the receiving endpoint instead of corresponding encoded picture data received from the transmitting endpoint.

When the communication network comprises a plurality of conference bridges in a cascade configuration,
At least one conference bridge that is not the last in the cascade configuration, optionally to other conference bridges without processing composite coded pictures received from a previous conference bridge in the cascade configuration Transferring, or decomposing the composite coded pictures received from a previous conference bridge in the cascade configuration and recombining with different layouts before transferring them to other conference bridges. The method of claim 18 further comprising:

19. The method of claim 18 , further comprising receiving a request from a receiving endpoint for a desired spatial resolution of an output video signal at the conference bridge .

The method of claim 18 , further comprising including source identification information and other information via one of an in-band bit stream and an out-of-band bit stream sent by the conference bridge .

In the conference bridge, the information carried in the Source Identification information or other, (1) pixel of the partial regions of the composite signals allocated to each participant in the signals, (2) the transmission participant's video The method of claim 18 , further comprising overwriting one of the pixels of the portion of the composite signal region that is not assigned to any of the signals.

Using the conference bridge ,
Statistically multiplexing video signals from multiple transmission endpoints;
In order to shift larger than average video pictures in the composite output video signal, the method further comprising the step of complying with a bandwidth condition by at least one of synthesizing and transmitting the video signal received from the transmission endpoint. The method of claim 18 comprising.

The step of responding to bandwidth conditions using the conference bridge further comprises:
The encoded picture data received from the at least one transmission endpoint is replaced with the encoded data indicated by the at least one reception endpoint so as to duplicate corresponding pixel data from a previous picture. Changing the bit rate of the combined composite output signal;
19. The method of claim 18 , comprising transmitting the replacement encoded data so that the output bit rate can match a desired characteristic.

19. The conference bridge of claim 18, further configured to generate an input video signal by generating artificial layer data for at least one of the video signals of the transmitting endpoint. Method.

A computer-readable medium having a program for causing a computer to execute the method according to any one of claims 18 to 34.